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ABSTRACT 

Exploratory  data  analysis  problems  have  recently  grown  in  importance  due  to 
the  large  magnitudes  of  data  being  collected  by  everything  from  satellites  to  supermarket 
scanners.  This  so-called  "data  glut"  often  precludes  the  effective  processing  of 
information  for  decision-making.  These  problems  can  be  seen  as  search  problems  over 
massive  unstructured  spaces.  A  prototypical  problem  of  this  type  involves  the  search,  by 
Department  of  Defense  medical  agencies,  for  a  so-called  "Desert  Storm  Syndrome" 
which  involves  large  amounts  of  medical  data  obtained  over  several  years  following  the 
Persian  Gulf  conflict.  This  data  ranges  over  more  than  170  attributes,  making  the  search 
problem  over  the  attribute  space  a  hard  one.  We  propose  the  use  of  genetic  algorithms  for 
the  attribute  search  problem,  and  intertwine  it  with  search  algorithms  at  the  detailed  data 
level.  Computational  results  so  far  strongly  suggest  that  our  system  has  succeeded  at  the 
given  tasks,  requiring  relatively  few  resources.  They  also  have  found  no  indication  that  a 
single  syndrome  or  other  medical  entity  is  responsible  for  wide-spread  adverse  health 
ramifications  among  a  significant  cross-section  of  Persian  Gulf  War  participants  in  the 
CCEP  program.  There  are,  however,  numerous  correlations  of  exposure/demographic 
information  and  associated  symptoms/diagnoses  which  suggest  that  smaller  groups  may 
share  common  health  conditions  based  on  shared  exposure  to  common  health  risk  factors. 
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I.  INTRODUCTION 


A.       ANALYSIS  OF  LARGE  DATABASES 

Twenty  years  ago,  computers  were  relatively  scarce  and  applied  to  limited,  highly 
specialized  applications.  At  that  time,  there  were  rarely  enough  computerized  data  to  make  them 
an  integral  part  of  any  organization's  decision-making  process.  As  technology  approached  the 
present  day,  automated  information  systems  became  more  capable  and  more  involved  in  daily 
life.  They  began  capturing  more  and  more  data,  allowing  the  computer  to  become  an  active 
participant  in  expanding  facets  of  daily  decision-making.  The  exponentially  increasing  volume  of 
available  data  has  transformed  the  decision  challenge  from  one  of  "data  starvation"  to  "data 
saturation."  Fayyad,  Piatesky-Shapiro,  Smyth,  and  Uthurusamy  (Fayyad,  et.al.,  1996,  pp.  xv- 
xvi)  attribute  this  "mountain  of  stored  data"  to  such  factors  as  advances  in  scientific  data 
collection,  introduction  of  bar  codes,  and  the  computerization  of  many  business  and  government 
transactions.  In  many  situations  today,  there  is  so  much  data  that  human  beings  are  unable  to 
correlate  it  all,  and  decision  quality  is  again  hampered,  or  in  the  words  of  John  Naisbett  (Fayyad, 
et.al.,  1996,  p.  xv.),  "We  are  drowning  in  information,  but  starving  for  knowledge." 

Clearly  mere  is  a  growing  need  for  "intelligent  agents,"  or  automated  information 
systems  that  can  sift  through  these  mountains  of  data  (which  other  systems  have  efficiently 
collected)  and  integrate  these  sources  into  concise,  usable  knowledge  for  use  in  human  decision- 
making. It  is  doubtful  mat  a  computer  can  reproduce  the  innovative  creativity  of  a  human 
analyst,  but  a  computer  system  can  be  imparted  with  a  basic  representation  of  some  of  what  the 
human  analyst  desires.  This  representation  of  interest  is  then  used  to  filter  vast  volumes  of 
available  data  (a  task  too  time  consuming  for  humans)  and  present  the  human  analyst  with  a 
more  concise  body  of  knowledge  in  an  understandable  form.  This  premise  is  supported  by  many 
documents,  such  as  this  quote  from  Fayyad,  et.  al.: 

Such  volumes  of  data  clearly  overwhelm  the  traditional  manual  methods  of  data 
analysis  such  as  spreadsheets  and  ad-hoc  queries.  Those  methods  can  create 
informative  reports  from  data,  but  cannot  analyze  the  contents  of  those  reports 
to  focus  on  important  knowledge.  A  significant  need  exists  for  a  new  generation 
of  techniques  and  tools  with  the  ability  to  intelligently  and  automatically  assist 


humans  in  analyzing  the  mountains  of  data  for  nuggets  of  useful  knowledge. 
These  techniques  and  tools  are  the  subject  of  the  emerging  field  of  knowledge 
discovery  in  databases  (KDD).  (Fayyad,  et.al.,  1996,  p.  2) 

The  Comprehensive  Clinical  Evaluation  Program  (CCEP)  database  presents  this  type  of 
challenge  to  data  analysis.  The  CCEP  database  contains  vast  amounts  of  information  on  over 
19,000  Persian  Gulf  War  (PGW)  veterans  who  have  brought  some  form  of  health  concern  to  the 
attention  of  the  Department  of  Defense  (DoD)  military  healthcare  system.  The  database  contains 
a  large  number  of  attributes,  and  there  are  still  no  defined  parameters  for  search.    In  any  case, 
because  of  problem  structure  and  sheer  size,  the  entire  database  cannot  be  comprehensively 
analyzed  by  conventional  means.  The  goal  of  this  thesis  is  to  design,  construct,  and  implement 
an  artificially  intelligent  computer  system  which  can  analyze  the  CCEP  database  more  efficiently 
than  a  conventional  or  "brute  force"  approach  without  unduly  taxing  scarce  medical  research 
assets.  Such  computer  systems  are  said  to  carry  out  "data  mining." 


B.       PURPOSE  OF  THIS  RESEARCH 

The  ultimate  purpose  of  this  research  is  provide  the  CCEP  program  with  a  viable 
methodology  to  obtain  useful  information  from  its  database  of  participating  PGW  veterans. 
Determining  what  constitutes  "useful"  or  "interesting"  information  is  at  least  as  great  a  challenge 
as  devising  an  analysis  tool.  However,  in  the  initial  stages  of  medical  research,  interesting 
information  is  any  statistical  association  between  database  attributes  of  different  categorical 
groups.  These  associations  may  signal  the  existence  of  an  undiscovered  common  ailment  or 
"syndrome"  affecting  participants  in  the  Persian  Gulf  War. 

Time  and  other  resources  are  also  key  factors  in  the  overall  CCEP  research  project. 
Simply  investigating  every  possible  combination  of  attributes  may  be  theoretically  feasible,  but 
in  actuality  often  necessitates  an  unpractically  large  commitment  of  resources  to  the  analysis 
task.  Therefore,  investigative  speed  and  efficiency  have  become  key  factors  in  this  research.  The 
need  for  speed  and  efficiency  demand  that  mis  research  develop  an  intelligent  search  device 
capable  of  sifting  through  vast  amounts  of  raw  data  and  identifying  interesting  trends  or 
correlations  without  the  need  for  human  intervention.  Consequently,  a  genetic  algorithm  has 


been  selected.  No  commercial  product  suited  our  particular  needs,  so  the  purpose  of  this  research 
includes  the  development  and  application  of  a  genetic  algorithm  suited  to  analysis  of  medical 
data,  specifically  the  CCEP  database. 

Finally,  this  research  evaluated  the  success  of  the  new  genetic  algorithm  (DaMI,  the  NPS 
Data  Miner)  from  several  aspects: 

•  DaMI  performance  adheres  to  classical  genetic  algorithm  theory 

•  DaMI  statistical  computations  are  valid  and  reproducible 

•  DaMI  efficiently  and  comprehensively  analyzes  the  search  space 

•  Outcome  hypotheses  are  of  significant  value  to  medical  experts  and  the  program 
sponsor 

As  with  problem  stmcturing,  validation  of  results  has  proven  to  be  a  major  research  challenge 
and  is  addressed  in  this  paper. 

Computational  results  so  far  strongly  suggest  that  our  system  has  succeeded  at  the  given 
tasks,  requiring  relatively  few  resources.  They  also  have  found  no  indication  that  a  single 
syndrome  or  other  medical  entity  is  responsible  for  wide-spread  adverse  health  ramifications 
among  a  significant  cross-section  of  Persian  Gulf  War  participants  in  the  CCEP  program.  There 
are,  however,  numerous  correlations  of  exposure/demographic  information  and  associated 
symptoms/diagnoses  which  suggest  that  smaller  groups  may  share  common  health  conditions 
based  on  shared  exposure  to  common  health  risk  factors. 

C.       SCOPE  OF  RESEARCH 

This  research  examines  the  problem  structuring  challenges  for  analyzing  the  data 
contained  in  the  CCEP  database.  It  discusses  the  general  qualities  of  genetic  algorithms  and  the 
specific  techniques  used  to  apply  a  genetic  algorithm  to  the  study  of  the  CCEP  database.  The 
research  focuses  on  application  of  a  genetic  algorithm  to  a  relevant  real-world  problem  and  does 
not  contain  an  in-depth  description  of  genetic  algorithm  theory.  An  original  genetic  algorithm 
(DaMI)  was  created  by  this  research  effort.  A  technical  description  of  the  DaMI  algorithm,  its 
development  process,  and  evaluation  methodology  are  included.  It  is  not  the  purpose  of  this 


research  to  survey  all  possible  solutions  to  the  CCEP  analysis  challenge,  but  rather  to  completely 
examine  and  document  one  apparently  successful  solution.  Finally,  the  results  of  the  DaMI 
analysis  of  the  CCEP  database  are  presented  along  with  the  validation  process  and 
recommendations  for  further  research.  The  following  research  questions  were  addressed: 

•  If  there  is  a  (actually  there  may  be  more  than  one)  common  ailment  or  "syndrome" 
afflicting  veterans  of  the  Persian  Gulf  War,  how  will  it  manifest  itself  within  the 
scope  of  information  gathered  by  the  CCEP  database? 

•  How  will  the  subjective  concept  of  interesting  information  (to  the  medical 
community)  be  quantitatively  measured  and  used  to  compare  the  "fitness"  of 
different  hypotheses? 

•  How  should  the  research  problem  and  database  be  structured  to  facilitate  automated 
analysis? 

•  Why  is  a  genetic  algorithm  a  more  effective  means  of  analyzing  the  CCEP  search 
space  than  other  more  conventional  methods? 

•  How  was  DaMI  constructed?  What  were  the  design  considerations  and  key 
innovations  in  this  particular  genetic  algorithm? 

•  What  analyses  were  conducted  and  what  were  the  results? 

•  Were  the  results  validated  and  were  they  useful  to  the  project  sponsor  (CCEP, 
Deployment  Surveillance  Team)  and  CCEP  medical  researchers? 


D.       REAL  WORLD  APPLICABILITY 

A  great  deal  of  research  has  been  performed  on  genetic  algorithms  and  related  artificial 
intelligence-based  research  tools.  In  many  cases,  the  data  analyzed  were  real  but  in  few  cases  the 
research  was  tied  into  a  real  world  time-sensitive  research  problem.  One  of  the  primary  reasons 
for  using  a  genetic  algorithm  is  that  an  answer  is  needed,  but  conventional  research  resources  are 
not  available  to  produce  that  answer  within  the  allotted  time.  This  makes  a  study  of  a  real -world 
genetic  algorithm  development  all  the  more  interesting.  The  CCEP  database  research  is  highly- 
visibile,  relevant,  and  time-sensitive. 


Only  a  select  number  of  medical  issues  have  received  as  much  attention  as  the  proverbial 
"Desert  Storm  Syndrome"  in  recent  years.  Since  the  first  returning  Persian  Gulf  War  (PGW) 
veterans  began  reporting  health  issues,  this  subject  has  received  constant  attention  by  the  U.S. 
government,  military  medical  researchers,  and  most  prolifically  the  media.  A  Presidential 
commission  has  been  appointed  to  determine  what,  if  any,  health  ailments  may  be  attributed  to 
the  service  of  U.S.  armed  forces  in  the  Persian  Gulf.  Research  efforts  continue  at  many  DoD  and 
Veterans  Administration  (VA)  facilities.  It  is  certainly  appropriate  to  say  that  the  CCEP  is  "high 
visibility." 

Similarly,  the  concept  of  relating  diseases  to  groups  of  humans  with  similar  symptoms 
and  life  experiences  (demographics  and  exposure  to  physical  objects)  has  been  a  focus  of  medical 
research  for  many  years.  Some  of  the  earliest  genetic  algorithm  experiments  attempted  to  relate 
symptoms  to  diagnoses.  Medical  science  has  consistently  searched  for  better  ways  to  answer  the 
question,  "What  caused  this  disease?"  In  the  case  of  CCEP,  697,000  veterans  (not  to  mention 
their  families)  are  eager  to  know  if  their  service  in  the  PGW  increases  their  susceptibility  to  any 
type  of  medical  malady.  From  an  academic  perspective,  the  issue  of  automatically  identifying 
"interesting"  information  has  become  increasingly  fascinating  and  challenging.  Technology  has 
increased  researchers'  ability  to  automate  aspects  of  a  medical  situation,  but  the  problem  of 
making  a  model  that  accurately  reflects  the  information  remains. 

E.       THESIS  METHODOLOGY  AND  ORGANIZATION 

This  research  begins  with  examination  of  the  CCEP  research  challenge  as  a  whole.  The 
first  challenge  is  to  structure  the  CCEP  research  question  of  what  is  an  "interesting"  hypothesis 
into  a  mathematical  formula  (fitness  function).  This  in  turn  returns  a  higher  "fitness"  to 
hypotheses  of  greater  interest  to  CCEP  medical  researchers.    Our  research  tried  many 
alternatives,  but  settled  on  the  use  of  the  Modified  J-measure  (described  in  section  H.E.4.c)  to 
assess  relative  independence  between  premise  and  outcome  variables.  The  CCEP  database  was 
not  designed  with  medical  research  in  mind,  so  the  second  challenge  was  to  reformat  the  database 
into  a  structure  which  supported  automated  analysis. 


Once  the  problem  and  source  database  were  structured  appropriately,  a  suitable  research 
tool  was  needed.  It  was  clear  that  using  a  "brute  force"  approach  to  examine  the  CCEP  database, 
even  using  computer  simulation,  was  impractical  because  of  the  tremendous  size  of  the  search 
space.  A  genetic  algorithm  was  chosen  because  of  the  innate  ability  of  genetic  algorithms  to 
inductively  adapt  to  the  researcher's  goals  and  to  intelligently  analyze  a  search  space,  bypassing 
hypotheses  which  show  little  chance  of  future  success.  Our  concept  enhanced  the  conventional 
genetic  algorithm  approach  by  dividing  the  process  into  two  modules:  A  genetic  operator,  which 
handles  selection  and  recombination  of  hypotheses  at  the  field  level  only,  and  a  statistical 
package,  which  analyzes  every  possible  combination  of  hypothesis  fields  passed  from  the  genetic 
operator  and  returns  an  integrated  fitness  measure  for  the  entire  hypothesis.  Additionally,  our 
tool  examines  multiple  independent  and  dependent  (LHS  and  RHS)  fields  because  CCEP  could 
not  determine  which  field  or  combination  of  fields  would  identify  a  target  outcome. 

Finally,  the  problem  of  validation  and  search  space  coverage  must  be  addressed.  A  great 
deal  of  literature  supports  the  idea  that  a  genetic  algorithm  can  deduce  hypotheses  that  apply  to  a 
database.  However,  it  is  critical  that  these  results  be  both  validated  against  independent  data  and 
that  they  be  indicated  to  accurately  address  the  research  question,  instead  of  just  exploring  the 
data  actual  set  analyzed.  Several  tools  were  developed  to  validate  the  results,  among  them  an 
independent  validation  algorithm  which  independently  re-tests  results  hypotheses  against  the 
subject  database  and  a  cross-validation  procedure  that  tests  hypotheses  generated  from  one 
randomly-sampled  subset  of  the  databases  against  another  randomly  sampled  subset. 

The  thesis  is  divided  into  seven  chapters: 

•  Chapter  I :  Introduction 

•  Chapter  II :  Description  of  the  CCEP  Research,  the  database  itself,  and  problem 
structure  challenges 

•  Chapter  III :  Overall  solution  concept  and  high-level  research  approach 

•  Chapter  IV  :  Description  of  the  DaMI  algorithm,  its  design,  implementation,  and 
validation  processes 

•  Chapter  V  :  Technical  description  of  the  DaMI  algorithm  operators,  innovations,  and 
procedures 


•  Chapter  VI :  Summary  of  results 

•  Chapter  VII :  Conclusion  and  recommendations  for  future  research 
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II.  COMPREHENSIVE  CLINICAL  EVALUATION  PROGRAM 


A.       BACKGROUND  AND  HISTORY  OF  CCEP 

The  Department  of  Defense  (DoD)  began  to  examine  the  health  consequences  of  Persian 
Gulf  War  (PGW)  service  while  U.S.  troops  were  still  deployed  to  the  Persian  Gulf  Region.  The 
initial  focus  of  medical  researchers  was  on  the  health  risks  associated  with  smoke  from  Kuwaiti 
oil  fires.  As  early  as  1992,  groups  of  PGW  veterans  began  presenting  with  health  complaints 
which  they  attributed  to  PGW  service.  Many  of  these  veterans  reported  nonspecific  symptoms  or 
those  not  directly  attributable  to  a  specific  disease  or  syndrome  (group  of  commonly  occurring 
symptoms/conditions).  This  sparked  the  first  of  many  tests  (first  by  the  Army  in  1992  and 
subsequently  by  other  services)  to  attempt  to  discover  if  these  non-specific  symptoms  could  be 
linked  with  any  "clusters"  of  PGW  veterans.  The  theory  of  this  approach  is  that  a  new  syndrome 
will  present  as  a  "cluster7'  or  group  of  individuals  sharing  some  common  trait  (demographics, 
location,  action,  exposures,  etc.)  who  also  share  a  similar  group  of  symptoms.  (CCEP,  1996,  pp. 
6-7)    This  is  the  first  step  to  identifying  a  new  syndrome.  Once  a  syndrome  is  defined,  then 
medical  researchers  begin  efforts  to  find  the  cause  of  the  syndrome.  If  a  solid  cause-effect 
relationship  is  established  and  documented  between  an  entity  (virus,  bacteria,  etc.)  or  health  risk 
factor(s)  (like  smoking  or  cholesterol),  then  the  syndrome  may  be  considered  a  full-fledged 
disease. 

In  response  to  the  health  concerns  of  PGW  Veterans,  both  DoD  and  Veterans  Affairs 
(VA)  established  similar  comprehensive  clinical  evaluation  programs.  The  data  for  this  research 
comes  from  the  DoD  CCEP.  The  CCEP  program  was  officially  enfranchised  by  the  Assistant 
Secretary  of  Defense  (Health  Affairs)  as  part  of  a  three-point  plan,  announced  on  1 1  May  1994. 
This  plan  included: 

•     The  development  of  an  aggressive,  comprehensive,  clinical  diagnostic  program  to 
offer  intensive  examinations  to  veterans  who  do  not  have  clearly  defined  diagnoses, 


•  An  initial  independent  review  ofDoD  clinical  and  research  efforts  concerning  the 
Persian  Gulf  War  by  Dr.  Harrison  C.  Spencer,  Dean  of  the  Tulane  School  of  Public 
Health  and  Tropical  Medicine,  New  Orleans,  Louisiana,  and 

•  The  creation  of  a  forum  of  national  medical  and  public  health  experts  to  review, 
comment,  and  advise  DoD  concerning  the  results  of  the  clinical  evaluation  program. 
(Joseph,  1994) 

CCEP  continues  to  offer  in-depth  medical  examinations,  through  the  Military  Health  Services 

System  (MHSS)  to  any  PGW  veteran  having  health  concerns.  Over  27,000  PGW  veterans  and 

their  dependents  have  initiated  medical  examinations  with  CCEP,  of  which  over  19,000  have 

been  completed  by  the  participants.  The  data  collected  from  these  19,000  participants  has  been 

recorded  in  a  single  database  (the  CCEP  database),  which  is  the  source  database  for  this  research. 

(CCEP,  1996,  pp.  7  -  12) 

Since  the  inception  of  CCEP,  numerous  medical  research  programs  have  been  conducted 

by  DoD  and  non-DoD  health  organizations  (including  the  Defense  Science  Board,  National 

Institute  of  Health,  Naval  Health  Research  Center  in  San  Diego,  University  of  California, 

Department  of  Health  and  Human  Services,  and  National  Academy  of  Sciences).  Although 

several  research  efforts  are  still  ongoing,  the  possibility  of  an  unknown  syndrome  or  disease 

affecting  PGW  veterans  and  their  families  has  been  exhaustively  examined.  DoD  has  committed 

to  continue  research  on  this  issue  but  stated: 

To  date,  there  is  no  clinical  evidence  for  a  previously  unknown,  serious  illness 
or  'syndrome '  among  Persian  Gulf  veterans  participating  in  the  CCEP.  A 
unique  illness  or  syndrome  among  Persian  Gulf  veterans  evaluated  through  the 
CCEP,  capable  of  causing  serious  impairment  in  a  high  proportion  of  veterans 
at  risk,  would  probably  be  detectable  in  the  population  of  18,598  patients. 
However,  an  unknown  illness  or  a  syndrome  that  was  mild  or  affected  only  a 
small  proportion  of  veterans  at  risk  might  not  be  detectable  in  a  case  series,  no 
matter  how  large.  (CCEP,  1996,  p.  4) 

It  is  this  viewpoint  that  has  catalyzed  the  need  for  an  intelligent,  automated  search  program  to 
analyze  the  CCEP  database.  Clearly,  conventional  research  (user-controlled  query  and  clinical 
evaluation)  has  reached  the  limit  of  available  resources,  and  yet  there  is  still  a  possibility  that  a 
syndrome  has  remained  undetected.  Proper  implementation  of  a  genetic  algorithm  can  expand 
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the  horizon  of  research  by  sifting  through  hypotheses  not  yet  considered  but  will  do  so  using 
small  amounts  of  time,  funds,  and  human  effort. 


B.       CCEP  RESEARCH  VISION 

The  core  of  CCEP  research  is  based  on  classic  epidemiological  technique.  The  CCEP 
database  has  been  constructed  to  capture  as  wide  a  range  of  data  about  PGW  participants  as  is 
practical.  Data  collection  practices  have  been  standardized  and  unbiased— any  participant  with  a 
concern  undergoes  the  same  health  screening  and  examination  process.  The  basic  premise  of 
analysis  is  that  a  new  syndrome  will  present  as  "prominent  and  consistent  physical  and 
laboratory  findings"  like  Legionnaire's  disease  or  toxic  shock  syndrome  or  consistent  "non- 
specific symptomatology"  as  with  chronic  fatigue  syndrome  and  fibromyalgia. 

In  any  case,  CCEP  research  efforts  focus  on  slicing  the  database  in  many  different 
directions,  whether  by  demographic  information,  symptoms,  diagnoses,  or  reported  exposure 
categories.  Percentages  of  PGW  participants  in  each  slice  or  "cluster"  (which  is  a  group  of 
participants  with  the  same  characteristics  within  a  given  research  slice)  are  compared  to  the  per- 
centage expected  within  a  similar  population  not  participating  in  the  PGW.  In  many  cases 
(especially  when  the  database  is  sliced  by  reported  exposures),  no  comparable  group  is  available, 
so  these  percentages  are  compared  against  actual  percentages  or  distributions  among  all  697,000 
PGW  personnel  (as  opposed  to  just  those  participating  in  CCEP).    The  point  of  the  analysis  is  to 
isolate  any  characteristic  which  appears  to  make  a  CCEP  participant  more  likely  to  have 
approached  CCEP  with  a  medical  condition. 

If  some  specific  combination  of  demographics,  personal  habits  (smoking/non-smoking), 
and  reported  exposure  is  associated  with  specific  symptoms  and  diagnoses  with  the  group  of 
CCEP  participants,  then  medical  research  is  developed  to  clinically  test  the  relationship  of  these 
factors  to  personal  health.  It  should  be  apparent  that  this  approach  is  extremely  resource 
intensive.  Analysis  dimensions  are  limited  to  the  imagination  of  individual  researchers 
developing  the  slices  and  the  physical  ability  of  medical  researchers  to  examine  the  hypothesis. 
If  the  quality  of  "statistical  interest"  could  be  mathematically  modeled  by  an  automated  research 
tool,  then  the  dimensions  of  analysis  could  be  expanded  to  the  limits  of  computer  (as  opposed  to 
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human)  resources.  The  genetic  algorithm  (DaMI)  is  a  research  tool  designed  specifically  to 
relieve  humans  from  the  drudgery  of  human-controlled  analysis  so  that  they  may  focus  efforts  on 
clinical  testing  which  machines  cannot  do. 


C.       DATABASE  DESCRIPTION 

The  CCEP  database  is  a  "flat  file"  or  single  table  with  177  attributes.  It  was  created  in 
standard  dBase®  format  and  was  actually  received  and  manipulated  using  the  Visual  Foxpro® 
Database  Management  System  (DBMS).  The  database  was  not  designed  with  automated 
analysis  or  medical  research  (for  that  matter)  in  mind.  Therefore,  a  great  deal  of  manual  file 
manipulation  was  required  before  automated  analysis  was  possible.  By  "manual"  we  mean  the 
issuance  of  single  SQL®  commands  to  reformat  individual  database  schema  and  field  values.  At 
no  time  was  the  actual  data  adjusted,  but  in  many  cases  the  representation  schema  was  changed 
to  enhance  automated  processing.  Appendix  A  contains  the  CCEP  data  dictionary  alone,  a 
commentary  on  modifications/usability  of  each  field,  and  a  synopsis  of  the  CCEP  data  collection 
process.  The  actual  database  used  for  research  contains  17,033  records  for  active  duty  CCEP 
participants.  Dependent  records  were  removed  prior  to  analysis  at  the  request  of  the  CCEP 
program  manager. 

A  large  number  of  attributes  containing  administrative  and/or  privacy  act  data  were 
removed  from  the  database  and  other  attributes  were  added  to  enhance  the  schema,  as  discussed 
above.  (For  a  more  complete  description  of  schema  modifications,  see  section  E.D.2)  In  all,  140 
attributes  were  present  in  the  research  database.  Not  all  were  examined  at  once  (see  Section 
VIA),  but  in  any  case  the  database  was  relatively  large  by  medical  or  occupational  health 
research  standards.  The  remaining  attributes  fall  into  four  major  categories: 

•  Demographic.  Physical  attributes  of  each  participant  (e.g.  race,  gender,  age,  home 
state,  service  component,  Unit  Identification  Code  [UIC]) 

•  Reported  Exposures.  Reported  exposures  to  potentially  hazardous  environmental 
conditions  by  participants  (e.g.  botulism  vaccine,  oil  smoke,  uranium,  passive 
smoke,  local  water,  SCUD  attack) 
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•  Reported  Standard  Symptoms.  Standard  symptoms  elicited  by  physicians  during 
CCEP  medical  examinations  (e.g.  difficulty  breathing,  fatigue,  headaches) 

•  Diagnoses.  Each  participant  completing  the  entire  CCEP  medical  examination 
process  was  assigned  a  primary  and  up  to  six  secondary  diagnoses.  Diagnoses 
followed  the  standard  numeric  ICD  coding  system  (e.g.  V65.5  -  Healthy  Exam, 
307.81  -  Chronic  Muscle  Tension  Headaches,  780.71  -  Fatigue) 

As  will  be  seen  in  later  sections,  most  analysis  was  conducted  on  associations  between  these 
major  attribute  categories. 


D.       WHY  DOES  A  GENETIC  ALGORITHM  WORK  FOR  CCEP 
ANALYSIS? 


1.         Theory 

The  theory  of  genetic  algorithms  was  invented  by  John  Holland  in  the  early  1970's. 
Holland's  purpose  was  to  create  a  search  method  based  on  the  process  of  natural  selection 
observed  in  nature.  He  likened  the  attributes  making  up  a  hypothesis  in  a  search  problem  to 
chromosomes  which  "encode"  a  living  being.  He  proposed  that  by  creating  mathematical 
representations  of  genetic  reproduction  and  applying  natural  selection,  scored  by  a  fitness 
function,  to  those  representations,  he  could  create  an  adaptive  search  engine.  Automation  of  this 
process  has  proven  to  be  an  excellent  task  for  computer  systems.  Although  a  great  deal  of 
evolution  is  not  understood,  several  general  features  are  agreed  upon:  (Davis,  1991,  pp  2  -  3) 

•  Evolution  is  a  process  that  operates  on  chromosomes  rather  than  on  the  living  beings 
they  encode. 

•  Natural  selection  is  the  link  between  chromosomes  and  the  performance  of  their 
decoded  structures.  Processes  of  natural  selection  cause  those  chromosomes  that 
encode  successful  structures  to  reproduce  more  often  than  those  that  do  not. 
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•  The  process  of  reproduction  is  the  point  at  which  evolution  takes  place.  Mutations 
may  cause  the  chromosomes  of  biological  parents,  and  recombination  processes  may 
create  quite  different  chromosomes  in  the  children  by  combining  material  from  the 
chromosomes  of  two  parents. 

•  Biological  evolution  has  no  memory.  Whatever  it  knows  about  producing 
individuals  that  will  function  well  in  their  environment  is  contained  in  the  gene  pool- 
-the  set  of  chromosomes  carried  by  the  current  individuals—and  in  the  structure  of  the 
chromosome  decoders. 

If  one  is  to  follow  the  theory  of  natural  selection,  then  it  could  be  inferred  that  attributes  used  to 
make  hypotheses  are  the  operators  of  evolution.  The  process  of  hypothesis  evolution  revolves 
around  the  combination  of  those  constituent  attributes  of  successful  hypotheses  and  their 
resulting  recombinations.  Furthermore,  these  recombinations  are  directed  blindly  and  guided 
only  by  the  principle  that  attributes  belonging  to  hypotheses  of  higher  fitness  measure  are 
recombined  more  frequently  than  attributes  belonging  to  hypotheses  possessing  lower  fitness 
measure. 

Holland  went  on  to  create  three  genetic  operators  which  could  mathematically  recombine 
the  modeling  chromosomes  of  coded  hypotheses  to  mimic  genetic  recombination.  Hypotheses 
from  the  gene  pool  of  the  current  are  "selected"  with  a  bias  towards  hypotheses  with  higher 
fitness  measures,  and  then  operated  on  by  one  of  these  three  genetic  operators: 

•  Reproduction.  Asexual  reproduction  of  single  parent  rule  to  single  offspring  rule 
without  modification 

•  Crossover.  Sexual  reproduction  involving  the  exchange  of  chromosomes  between 
two  parents  producing  two  different  child  rules. 

•  Mutation.  Asexual  reproduction  of  single  parent  rule  with  random  modifications 
resulting  in  a  different  child  rule. 

Using  the  'Two-armed  and  k-armed  bandit  problems,"  (see  Holland,  1975  for  complete  proof) 
Holland  went  on  to  prove  that,  lacking  prior  knowledge  of  the  expected  value  of  two  or  multiple 
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choices,  allocating  slightly  more  than  exponentially  increasing  trials  to  choices  with  the  highest 
past  success  is  the  optimal  means  for  choosing  between  options.  The  results  of  this  theory  and 
its  relation  to  genetic  operators  is  summed  up  well  by  Goldberg: 

In  other  words,  to  allocate  trials  optimally  (in  a  sense  of  minimal  expected  loss), 
we  should  give  slightly  more  than  exponentially  increasing  trials  to  the  observed 
best  arm... Another  method  that  comes  even  closer  to  the  ideal  trial  allocation  is 
the  three-operator  genetic  algorithm  discussed  earlier.   The  schema  theorem 
guarantees  giving  at  least  an  exponentially  increasing  number  of  trials  to  the 
observed  best  building  blocks.  In  this  way  the  genetic  algorithm  is  realizable  yet 
near  optimal  procedure  (Holland,  1973a,  1975)  for  searching  among  alternative 
solutions.  (Goldberg,  1989): 

It  is  important  to  reiterate  that  genetic  algorithms  gain  their  speed,  not  by  analyzing  an  entire 
search  space,  but  from  deciding  which  attributes  (chromosomes)  hold  the  least  probability  of 
producing  interesting  hypothesis  and  not  testing  hypotheses  using  those  attributes.  The  process 
is  not  fixed,  for  it  relies  on  probability  for  modeling,  and  different  results  will  be  derived  each 
time  the  algorithm  is  run.  This  fact  will  be  discussed  further  in  the  discussion  of  results 
validation. 

Now  let's  bring  this  theory  closer  to  the  current  research  question.  A  hypothesis 
concerning  the  CCEP  database  may  be  "encoded"  into  a  string  representing  its  constituent 
attributes.  If  one  is  to  hold  with  Holland's  theory,  then  the  attributes  (in  this  case  demographic, 
exposure,  symptom,  or  diagnosis)  which  make  up  the  hypothesis  (in  a  group  or  hypotheses) 
having  the  highest  fitness  measure  should  be  recombined  in  an  exponentially  increasing  number 
of  fashions.  Similarly,  the  attributes  from  unsuccessful  hypotheses  should  be  recombined 
exponentially  less  often.  Genetic  operators,  used  in  the  DaMI  genetic  algorithm,  prove  be  the 
most  optimal  way  of  accomplishing  this  selection.  Finally,  if  this  process  is  followed,  then  the 
extremely  large  search  space  of  correlations  within  the  CCEP  database  will  be  searched  most 
efficiently  using  a  genetic  algorithm.  It  is  on  this  theoretical  basis  that  we  chose  a  genetic 
algorithm  to  analyze  the  CCEP  database. 
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2.         Advantages  and  Disadvantages  of  the  Genetic  Algorithm  Method 

There  is  a  great  deal  of  theoretical  literature  on  the  advantages  and  disadvantages  of 
using  genetic  algorithms.  It  is  the  intent  of  this  section  to  relate  practical  lessons  learned  from 
our  specific  research  using  DaMI  on  the  CCEP  database.  From  the  point  of  view  of  this  research, 
a  genetic  algorithm  was  particularly  useful  because  of  its  ability  to  process  tremendous  amounts 
of  data  and  its  lack  of  need  for  human  interaction.  It  has  already  been  proven  that  CCEP  problem 
search  space  is  too  large  to  analyze  by  conventional  means,  even  with  a  computer.  The  problem 
cannot  be  structured  strongly  enough  to  limit  the  possibilities  to  realistic  numbers,  so  technology 
is  being  relied  upon  to  perform  the  discrimination.  Medical  research  assets  are  a  scare  resource, 
so  employing  medical  experts  only  at  the  fitness  function  creation  and  final  analysis  stages 
produces  efficient  and  effective  results.  Should  preliminary  implementation  of  genetic 
algorithms  prove  informative  in  this  area  of  medical  research,  many  other  similar  research 
questions  may  benefit  from  this  technology. 

There  are  several  disadvantages  to  using  genetic  algorithms,  several  to  which  have 
already  been  alluded.  First,  as  can  be  seen  from  section  ELD,  a  great  deal  of  effort  must  be 
committed  to  database  structure  and  normalization  before  processing.  Since  the  system  relies  on 
computer  evaluation  of  data,  the  data  structure  and  coding  scheme  must  be  uniform  and 
conducive  to  information  extraction.  Non-descriptive  representations  and  textual  data  collection 
will  severely  curtail  system  performance.  The  strong  coding  and  standardization  of  the  CCEP 
database  was  one  of  the  aspects  that  made  it  so  attractive  for  this  type  of  research.  Second,  a 
genetic  algorithm  is  useless  without  a  single,  unambiguous  representation  of  what  is  interesting 
to  the  operator.  This  was  a  key  challenge  to  this  research.  There  are  many  measures  which  may 
infer  the  "interestingness"  of  a  particular  hypotheses,  but  the  synthesis  of  a  single  aggregate 
measure  which  satisfies  all  components  of  epidemiological  interest  has  been  extremely  difficult 
(several  different  fitness  functions  may  be  required).  Finally,  a  difficult  paradox  arises  when 
attempting  to  prove  that  a  genetic  algorithm  has  completely  searched  a  large  space.  A  genetic 
algorithm  achieves  its  speed  advantage  by  selective  analysis,  meaning  it  selectively  eliminates 
search  options  with,  apparently,  little  chance  of  yielding  interesting  results.  The  only  way  to 
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actually  prove  that  an  interesting  hypothesis  was  not  missed  is  to  physically  test  every 
hypothesis,  but  we  turned  to  the  genetic  algorithm  because  the  resources  necessary  to  search  the 
entire  space  were  not  available.  To  address  this  problem,  the  genetic  algorithm  is  run  several 
times.  If  the  outcomes  produced  by  several  independent  runs  have  a  high  intersection 
(particularly  among  hypotheses  of  high  fitness),  then  there  is  strong  evidence  that  the  space  has 
been  searched  adequately.  A  more  detailed  discussion  of  this  challenge  is  included  in  Chapter  V. 

To  sum  up,  this  research  has  found  that  genetic  algorithms  do  search  a  very  large  space 
of  alternatives  very  quickly  and  efficiently.  Successive  generations  of  hypotheses  quickly 
improve  in  quality  as  measured  by  the  fitness  function,  and  therefore  the  algorithm  does  adjust  its 
search  to  the  operator's  goals.  Strong  database  standardization  and  coding  are  a  must  before  any 
processing  is  attempted.  A  genetic  algorithm  has  proven  successful  to  this  research,  as  long  as  a 
fitness  function  can  be  created  which  accurately  defines  'Svhat  is  interesting"  to  the  researchers. 


E.       KEY  CHALLENGES  TO  CCEP  ANALYSIS  BY  A  GENETIC 
ALGORITHM 


1.         Problem  Structure 

The  single  most  challenging  aspect  of  this  research  is  that  'Tersian  Gulf  Syndrome"  as  it 
is  referred  to  by  the  media,  PGW  veterans,  and  some  researchers,  is  not  yet  really  a  defined 
syndrome  at  all.  A  syndrome  must  be  defined  by  a  unique  series  of  symptoms  and/or  ailments 
which  are  shared  by  a  specific  group  of  individuals.  Although  many  PGW  veterans  report  a  wide 
array  of  non-specific  medical  ailments  associated  with  PGW  service,  no  defined  set  of 
symptomatology  has  been  enstantiated  as  a  candidate  syndrome. 

CCEP  clinicians  have  identified  a  wide  range  of  specific  diagnoses  (i.e. 
migraine  headache,  depression,  asthma,  arthritis,  hypertension).  However,  few 
if  any  of  the  conditions  diagnosed  to  date  could  be  considered  specific  for  any  of 
the  many  different  exposures  implicated  as  potential  causes  of  Persian  Gulf 
illnesses.   Thus  as  a  case  series,  the  CCEP  has  identified  a  wide  spectrum  of 
different  clinical  conditions  rather  than  any  singular  homogeneous  diagnostic 
entity  (CCEP,  1996,  p.  79) 
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While  the  medical  implications  of  this  statement  are  serious,  the  impact  of  this  situation  on 
research  is  tremendous.  Basically,  CCEP  medical  researchers  cannot  provide  us  with  a 
description  of  a  target  syndrome  for  research,  or  for  that  matter  if  there  are  one,  many,  or  any 
syndrome(s)  at  all.  Without  target  syndrome  characteristics,  a  researcher  is  unable  to  identify 
which  field  or  combinations  of  fields  within  the  database  indicate  a  desired  outcome  (a  syndrome 
of  interest).  In  truth,  researchers  do  not  know  if  the  data  necessary  to  identify  a  syndrome, 
should  one  exist,  is  contained  in  the  database  at  all.  Therefore,  we  have  been  compelled  to 
develop  a  tool  which  can  examine  "interesting"  associations  between  any  number  of  causative 
and  outcome  attributes  without  specificity  as  to  the  limits  of  either  the  causative  or  outcome 
space.  This  is  both  a  curse  and  a  blessing;  the  lack  of  specifics  makes  the  problem  considerably 
more  challenging  but  also  stimulates  interest  in  our  type  of  tool. 

What  can  be  reasonably  asked  about  the  problem  is  the  following: 

•  Is  there  a  syndrome?  Is  there  subset  a  (of  A)  ailments  such  that  the  occurrence  rate 
of  a  in  PGW  participants  (G)  is  higher  than  die  rate  in  a  reference  population  (R)? 
[#a(G)  equates  to  "number  of  occurrences  of  an  ailment  within  the  set  of  participants 
(G)] 

#a{G)     #a(R) 
#(G)  >  #(*) 

•  What  caused  the  syndrome?  Is  there  a  subset  x  (of  X)  of  exposures  and/or 
demographic  experienced/attributed  to  participants  in  the  PGW  such  that:  for 
ailments  a  for  which  the  prior  equation  is  true,  exposures/demographics  x  account 
for  a  significant  part  of  the  difference  in  occurrence  rates  of  a  in  groups  G  and  R? 


D.  .     _.      #a(G)     #a(R)      D,  .     m 

P{a\x,G)  =  —±-f-  *  —f^-  =  P(a\x,R) 

#x(G)     #x(R) 
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The  lack  of  precise  target  syndrome  definition  encourages  the  development  of  multiple 
research  strategies.  As  mentioned  before,  the  directed  query  technique  used  by  CCEP  (CCEP, 
1996,  pp.  17  -  49)  has  sliced  the  database  from  numerous  different  perspectives.  What  is  needed 
is  a  search  tool  which  can  examine  multiple  combinations  of  independent  (LHS)  and  dependent 
(RHS)  variables  and  all  possible  values  for  each  variable  simultaneously.  This  adds  an  extra 
dimension  to  the  analysis.  Conventional  data  mining  tools  typically  allow  the  user  to  specify  a 
range  of  possible  LHS  variables  for  search  and  a  single  RHS  variable.  Multiple  RHS  fields  may 
still  be  handled  under  this  doctrine  by  creating  a  pseudo  field  which  contains  a  different  value  for 
each  unique  combination  of  values  in  the  RHS  fields  to  be  examined.  However,  if  the  RHS 
fields  for  analysis  are  large  in  number  or  cannot  be  specifically  identified,  the  pseudo  field 
coding  becomes  unpractically  large.  What  is  needed  instead  is  a  data  mining  tool  which  can 
apply  selective  induction  operators  to  a  range  of  possible  attributes  (not  just  individual  attribute 
and  value  instances)  on  the  LHS  and  RHS  simultaneously. 

This  methodology  is  plausible  and  in  fact  was  done  by  DaMI  in  this  research,  but  it  is 
prudent  to  note  that  this  strategy  will  still  produce  an  extremely  large  search  space.  For  example, 
the  first  analysis  done  by  DaMI  examines  the  associations  between  15  standard  symptoms  (LHS) 
and  21  possible  diagnoses  (RHS).  All  attributes  are  Boolean  and  are  not  limited  in  the  number  of 
simultaneous  combinations  (all  symptoms  and  diagnoses  could  be  simultaneously  present  or 
"true").  Therefore  the  possible  search  space  is  2    or6.8xl0    possible  hypotheses.  It  is  for  this 
specific  reason  that  we  chose  to  use  a  genetic  algorithm,  with  its  ability  to  discriminately  analyze 
tremendous  search  spaces.  A  test  was  conducted  in  which  this  particular  problem  was  analyzed 
using  simple  "brute  force"  (test  every  possible  combination  indiscriminately),  using  a  486DX/66 
Mhz  personal  computer.  The  personal  computer  was  able  to  test  about  600,000  combinations  per 
day.  At  this  rate,  this  one  complete  analysis  would  take  1 14,992  days  (315  years).  Even  if  a 
platform  were  chosen  that  was  100  times  faster  than  our  test  personal  computer,  the  analysis 
duration  would  be  an  unacceptable  3.15  years. 
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2.         Database  Content  and  Structure 

Several  problems  were  encountered  during  the  course  of  this  research  with  the  CCEP 
database  content  and  structure.  These  problems  fall  into  two  major  categories:  data 
representation  anomalies  which  make  it  difficult  for  an  algorithm  to  extract  meaningful 
information  from  the  data,  and  data  collection  anomalies  which  introduce  bias  into  the  data  being 
analyzed.  Examples  of  data  representation  anomalies  include  irrelevant  data  and  non-normalized 
data.  These  problems  must  be  corrected  before  useful  analysis  can  be  conducted;  they  usually 
require  modification  of  the  database  itself.  In  the  case  of  CCEP,  data  collection  anomalies 
include  data  that  were  self-reported  by  participants,  self-referral  of  PGW  veterans  to  the  CCEP 
program,  and  lack  of  an  established  control  group.  Collection  anomalies  do  not  interfere  with 
analysis  itself,  but  they  must  be  acknowledged  or  accounted  for  when  examining  results. 

Seventy-seven  fields  in  the  CCEP  database  are  simply  unusable.  Many  fields  contain 
sensitive  unclassified  data  on  the  participants  (names,  social  security  numbers,  addresses,  etc.) 
which  is  not  helpful  for  medical  research  and  is  subject  to  the  Privacy  Act  of  1974.  Those  fields 
were  deleted  at  the  outset.  Another  larger  group  of  fields  is  used  by  CCEP  for  administrative 
processing  and  are  similarly  not  helpful  to  research.  Finally,  there  were  some  fields  that  have 
been  collected  as  non-standardized  text.  The  most  serious  occurrence  of  this  is  the  "chief 
complaint"  or  in  other  words  the  reason  that  the  participant  approached  CCEP  for  an 
examination.  No  standardization  was  enforced  in  this  free-text  field  so  it  is  relatively  impossible 
for  a  computer  to  determine  similarity  between  tuples,  short  of  creating  a  complete  index  of  chief 
complaint  texts  and  some  standard  category  indicator.  This  is  fortunately  not  the  case  with 
diagnoses,  which  use  the  standard  numeric  ICD  coding  system.  Participant  complaint 
information  was  captured  in  the  form  of  fifteen  standard  symptoms,  but  a  coded  chief  complaint 
would  prove  most  helpful. 

A  key  shortcoming  of  the  database,  reported  at  the  outset  by  CCEP,  is  the  large  amount 
of  data  which  are  self-reported  by  participants.  Self-reported  data  are  that  which  is  directly 
determined  by  responses  from  participants  during  their  medical  examinations  (as  opposed  to 
clinical  test  results,  review  of  documentation,  or  impartial  third-party  observation).  Self-reported 
data  are  analogous  to  a  survey,  which  is  in  and  of  itself  not  a  database  flaw.  However,  in  the 
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context  of  CCEP,  all  exposure  and  standard  symptom  data  are  self-reported.  This  reduces  the 
direct  applicability  of  aggregate  participant  responses  because  perceived  exposure  may  be 
distinctly  different  from  actual  exposure.  This  is  most  easily  demonstrated  by  an  example  we 
call  "the  Botulism  Illusion."  Within  the  CCEP  database,  26.4%  (4,500)  of  the  active-duty 
participants  report  receiving  the  botulism  vaccine.  Now  it  is  known  from  medical  records  that 
only  8,800  or  1.26%  of  the  697,000  PGW  veterans  were  given  this  vaccine.  This  high 
percentage  (26.4%  of  participants)  would  appear  to  suggest  a  possible  relationship  between  the 
botulism  vaccine  and  PGW  medical  ailments,  until  it  is  pointed  out  that  21.9%  of  the  CCEP 
participants  who  were  examined  and  deemed  "healthy"  (primary  diagnosis  of  V65.5)  also 
reported  receiving  the  botulism  vaccine.  (See  Figure  #1)  Problems  concerning  reported  data 
may  be  compensated  for  by  collecting  and  examining  a  "control  group"  of  participants  who  do 
not  have  significant  medical  conditions;  however,  reported  data  should  always  be  interpreted 
with  some  degree  of  caution. 


Reported  by  CCEP  Participants        Reported  by  "V65.5"  Participants 


Actual 


□  R»c*lv«d 

Immunisation 
■  No  fmmuntatf  on 


Figure  1.  The  Botulism  Illusion 
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Another  obstacle  to  a  meaningful  analysis  of  the  CCEP  database  is  the  self-referral 
(participants  made  a  conscious  decision  to  start  the  CCEP  examination  process)  of  participants. 
As  described  in  Appendix  A,  any  individual  who  was  eligible  for  medical  care  under  the  MHSS 
system  in  1994  and  had  a  health  concern  related  to  PGW  service  (whether  directly  or  indirectly) 
could  request  a  full  medical  evaluation  under  the  CCEP  program.  This  encouraged  a  wide  range 
of  participants,  but  the  self-referral  of  patients  may  invalidate  the  CCEP  database  as  a  statistical 
representation  of  PGW  veterans  as  a  whole.  Had  the  participants  in  CCEP  been  selected 
randomly,  then  their  aggregate  response  and  demographic  data  could  have  been  considered 
statistically  representative.  In  this  case,  the  sheer  act  of  self-referral  introduces  some  level  of  bias 
which,  if  it  can  be  identified,  should  be  explained  to  the  degree  possible.  One  possible  solution 
is  to  randomly  select  a  suitably  large  group  of  PGW  veterans,  regardless  of  health  concerns,  and 
provide  them  with  the  same  medical  evaluation  as  the  other,  self-referred,  participants.  In  other 
words,  create  a  control  group.  A  control  group  will  help  identify  bias  from  both  self-reporting 
and  self-referring.  Unfortunately,  this  was  has  not  been  adopted  as  part  of  the  CCEP  program. 
Suggestions  have  been  made  to  create  a  control  group  after-the-fact,  but  a  strong  argument  can  be 
made  that  the  passage  of  time  since  1994  will  introduce  similar  bias  into  the  responses  of  a 
present-day  control  group. 

The  reader  should  not  infer  that  the  CCEP  database  is  a  poor  source;  it  has  many  strong 
points.  After  removal  of  unusable  fields  and  reformatting  other  fields  for  enhanced  analysis,  140 
"good"  fields  have  remained  for  analysis.  One  of  the  most  positive  aspects  of  the  database,  is  the 
standardization  of  CCEP  data  collection.  From  the  outset,  CCEP  used  the  same  database 
structure,  examination  process,  and  coding  scheme  for  all  medical  examinations.  There  are  some 
exceptions,  such  as  the  case  of  chief  complaint  (mentioned  above)  but  overall  the  data  content  is 
strongly  coded  and  standardized.  Any  reader  who  has  dealt  with  data  analysis  at  all,  should 
appreciate  the  importance  of  a  uniform  database  structure  and  coding  system  to  computer 
analysis.  Something  as  simple  a  representing  an  affirmative  response  as  "Y"  or  "Yes"  or  "yes" 
can  make  computer-based  query  far  more  difficult.  Of  particular  significance  was  the  uniform 
usage  of  numeric  ICD  codes  to  represent  outcome  diagnoses. 
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3.         Database  Normalization 

The  uniform  coding  scheme  used  in  the  CCEP  database  and  limited  need  for  scalar 
(continuous  numerical)  data  sharply  reduced  the  need  for  normalization  (when  used  in  a  data 
mining  context,  "normalization"  means  structuring  a  database  for  effective  computer  analysis). 
The  coding  scheme  used  in  the  CCEP  database  is  quite  strong,  so  only  a  few  modifications  were 
made  to  normalize  the  database.  Three  significant  modifications  were  made  to  the  schema  for 
analysis.  Diagnoses  were  converted  from  single  fields  to  multiple  Boolean  fields  to  facilitate 
analysis  of  diagnosis  combinations.  Standard  symptoms  were  changed  from  durations  to  simple 
occurrence  to  simplify  the  ambiguity  of  comparing  duration  categories.  Finally,  an  aggregate 
reproductive  disorder  field  was  created  to  relate  reported  reproductive  disorders  of  any  type. 

a.         Boolean  representation  of  diagnoses 

The  CCEP  database  captures  outcome  diagnoses  assigned  by  the  examining 
physician  as  a  primary  diagnosis  and  six  secondary  diagnoses.  CCEP  researchers  assign  a 
somewhat  higher  emphasis  to  the  primary  diagnosis,  and  place  little  weight  on  the  ordering  of 
secondary  diagnoses.  Therefore,  a  medical  researcher  would  not  differentiate  between  a 
diagnosis  of  fatigue  appearing  second  or  say  fourth  on  a  list  of  diagnoses  attributed  to  a 
participant.  A  computer  on  the  other  hand  could  consider  these  distinctly  different  occurrences. 
Since  combinations  are  tantamount  to  this  research,  it  is  much  easier  to  represent  and  analyze  a 
string  of  diagnosis  fields  with  Boolean  (yes  or  no)  operators  than  a  string  of  up  to  seven 
unordered  diagnoses.  However,  1700  different  diagnoses  were  assigned  to  the  19,000+  CCEP 
participants,  so  a  pure  Boolean  representation  would  be  extremely  unwieldy.  We  decided  to 
represent  the  twenty-one  most  frequently  occurring  diagnoses  as  Boolean  operators  in  addition  to 
the  existing  ICD  representation.  The  number  twenty-one  was  selected  arbitrarily  (it  can  be 
expanded  in  future  research),  but  at  least  one  of  the  selected  diagnoses  is  included  in  74.7%  of 
participant  outcomes.  See  Figure  #2  below. 
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Original  Diagnosis  Representation 
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Figure  2.  Diagnosis  Attribute  Restructuring 

b.  Standard  Symptoms 

In  the  CCEP  database,  participants  are  asked  to  report  suffering  from  fifteen 
standard  symptoms  (e.g.  chest  pain,  difficulty  breathing,  head  aches).  The  responses  are 
collected  dates  of  onset  and  duration.  The  date  and  duration  are  subjective  (and  subject  to  error), 
and  like  diagnoses,  difficult  for  an  automated  search  engine  to  compare.  A  higher  confidence  can 
be  assigned  to  a  response  if  it  is  represented  as  a  Boolean  (the  participant  will  in  most  cases 
accurately  report  existence  of  the  symptoms,  while  his/her  ability  to  estimate  an  onset  and 
duration  is  questionable).  Therefore,  fifteen  additional  fields  are  added  to  the  CCEP  database, 
one  corresponding  to  each  symptom  and  equal  to  "Y"  if  the  participant  reported  the  symptom  at 
any  time  for  any  non-zero  duration. 

c.  Reproductive  Disorders 

One  of  the  high  visibility  aspects  of  the  PGW  is  the  possibility  that  a  syndrome 
may  be  causing  PGW  participants  to  experience  a  higher  rate  of  reproductive  disorders 
(specifically  birth  defects).  The  CCEP  database  captures  reproductive  disorders  (participant  may 


24 


report  reproductive  disorder  actually  experienced  by  a  spouse  or  manifested  in  offspring)  in  five 
areas: 


• 


• 


Infertility 
Miscarriages 

•  Still  births 

•  Infant  deaths 

•  Birth  defects 


These  five  categories  are  further  subdivided  into  disorders  experienced  prior  to  and  after  PGW 
service,  making  a  total  of  10  reproductive  disorder  fields.  We  cannot  be  certain  that  a  syndrome, 
should  it  exist,  would  cause  only  one  form  of  reproductive  disorder.  Therefore,  two  new  fields 
were  created  to  reflect  any  reproductive  disorder  experienced  by  the  participant,  either  prior  to  or 
after  the  PGW  conflict.  In  other  words,  if  a  participant  reported  infertility,  a  miscarriage,  a  still 
birth,  an  infant  death,  or  a  child  with  birth  defects  prior  to  PGW  service,  then  the  new  field 
(PQ_prior)  was  set  to  "Y."  If  none  of  these  were  experienced  prior  to  PGW  service,  then 
PQ_prior  was  set  to  "N."  Similarly,  if  any  of  the  five  sub-categories  were  affirmatively  answered 
after  PGW  service,  then  PQ_after  was  set  to  "Y."  This  will  allow  the  research  to  be  more 
sensitive  to  associations  between  demographic,  exposure,  symptom,  and  diagnosis  data  and  any 
combination  of  reproductive  disorders.  Naturally,  any  interesting  associations  developed 
concerning  these  two  new  fields  will  need  to  be  re-categorized  by  medical  researchers  before  a 
finding  may  be  made. 

After  completion  of  normalization,  6  demographic,  32  reported  exposure,  15  (Boolean) 
standard  symptom,  and  21  (Boolean)  diagnosis  fields  are  available  for  automated  analysis. 
These  74  fields  observe  a  uniform  structure  and  coding  scheme  and  are  the  foci  of  this  research. 
Please  consult  Appendix  A  for  a  detailed  list  of  analyzed  fields. 
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4.         What  is  "Interesting?" 

In  Section  H.D.  1,  we  asked  the  question,  ''What  is  a  syndrome?"  It  is  necessary  at  this 
point  to  revisit  this  question,  but  from  an  automated  analysis  perspective.  A  genetic  algorithm 
depends  (as  do  many  other  techniques)  on  the  ability  of  the  researcher  to  define  in  quantitative 
terms  what  is  "interesting?"  The  problem  in  many  forms  of  decision  science  is  not  whether  a 
model  performs  accurately,  but  rather  if  it  improves  the  quality  of  a  decision.  In  a  genetic 
algorithm,  selection  of  hypotheses  to  evaluate  is  proportionally  related  to  a  "fitness"  value  for 
each  hypothesis,  so  it  is  critical  that  our  "fitness  function"  accurately  represents  the  interest  of 
medical  researchers.  This  characteristic  is  reflected  in  the  fundamental  genetic  theory: 

"Roughly,  the  fitness  ofaphenotype  is  the  number  of  its  offspring  which  survive 
to  reproduce... This  measure  rests  upon  a  universal,  and  familiar,  feature  of 
biological  systems:  Every  individual  (phenotype)  exists  as  a  member  of  a 
population  of  similar  individuals,  a  population  constantly  influx  because  of  the 
reproduction  and  death  of  the  individuals  comprising  it.  The  fitness  of  an 
individual  is  clearly  related  to  its  influence  upon  the  future  development  of  the 
population.   When  many  offspring  of  a  given  individual  survive  to  reproduce, 
then  many  members  of  the  resulting  population,  the  "next  generation, "  will 
carry  the  alleles  of  that  individual.  "  (Holland,  1975,  p.  12) 

This  returns  us  to  the  fundamental  question:  "What  is  interesting  to  CCEP  medical  researchers 
and  how  will  that  interest  be  manifested  in  the  database?"  In  Section  II.D.  I .  we  stated  that  we 
are  not  sure  whether  a  syndrome  exists,  and,  if  it  does  exist,  we  are  not  certain  that  the  data 
captured  in  the  CCEP  database  are  appropriate  to  identify  it.  However,  if  these  two  uncertainties 
are  removed,  the  following  assertions  can  be  made: 

•  If  there  are  one  or  more  syndrome(s)  affecting  PGW  veterans,  the  data  to  identify 
them  may  already  exist  in  the  CCEP  database  but  is  hidden  by  the  sheer  volume  of 
data. 

•  In  this  case,  a  syndrome  will  manifest  itself  as  a  single  or  unique  group  of  diagnoses 
or  symptoms  shared  by  a  cluster  of  participants  sharing  some  common  exposure 
and/or  demographic  attribute(s) 
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By  plunging  directly  into  a  search  for  associative  relationships  between  risk  factors  and 
outcomes,  we  bypass  a  fundamental  step  in  classical  epidemiological  technique.  Normally, 
epidemiologists  will  first  define  the  outcome  diagnoses  and/or  symptomatology  which  describe  a 
prospective  syndrome.  Once  the  definition  is  made,  then  research  efforts  are  focused  on 
associations  with  risk  factors  and  other  exposure  sources.  Unfortunately,  the  present  research  is 
left  with  a  less  than  optimal  situation.  We  suggest  that  a  promising  use  for  a  genetic  algorithm  is 
to  give  clues  to  medical  researchers  that  help  them  define  a  syndrome. 

In  this  research,  we  have  accepted  that  conventional  research  methods  alone  may  not  be 
able  to  define  and  isolate  a  syndrome  affecting  PGW  veterans.  We  are  now  led  to  re-examine  the 
problem  from  different  perspectives.  Our  research  approach  has  be  guided  by  the  following 
ideas: 

•  We  are  not  trying  to  create  an  analysis  that  will  isolate  a  single  pre-defined  Desert 
Storm  Syndrome.  Instead  we  are  defining  a  profile  that  a  syndrome  might  follow, 
should  it  exist.  Our  goal  is  to  determine  how  a  possible  syndrome  would  be 
reflected  in  the  data,  as  discriminately  as  possible,  and  then  construct  a  fitness 
function  which  is  appropriately  high  when  this  profile  is  met. 

•  Our  genetic  algorithm  does  not  find  a  Desert  Storm  Syndrome,  but  rather  distills  the 
billions  of  possible  hypotheses  into  a  set  of  hundreds.  All  in  the  set  of  candidate 
hypotheses  are  not  syndromes,  but  if  a  syndrome(s)  does(do)  exist,  it(they)  will  be 
found  in  the  candidate  set.  This  smaller  set  of  candidate  hypotheses  may  realistically 
be  examined  more  exhaustively  by  medical  researchers  and  other  conventional 
means. 

•  By  implementing  the  genetic  algorithm  as  a  precursor  to  medical  research  (and 
alleviating  the  idea  that  it  must  find  "the  answer"),  we  allow  the  genetic  algorithm  to 
significantly  reduce  the  burden  on  the  relatively  scarce  medical  research  assets  at  a 
relatively  small  cost  to  the  organization.  In  more  basic  terms,  the  secret  to  operating 
genetic  algorithms  in  an  imperfect  world  is  to  allow  them  to  do  the  first  80%  of  the 
analysis  work  with  only  20%  of  the  research  cost. 
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With  the  question  of  "interest"  now  bounded,  a  proper  fitness  function  may  now  be 
pursued.  If  a  true  syndrome  does  exist,  then  it  is  "caused"  by  something.  Therefore,  the 
participants  will  share  some  finite  set  of  exposure  mediums,  or  in  other  words  all  participants 
with  a  syndrome  will  share  some  commonality  in  exposure.  This  must  be  caveated  by  saying 
that  the  CCEP  database  may  or  may  not  contain  the  demographic  and  exposure  elements  to 
identify  that  commonality  of  exposure.  But  as  our  research  mindset  states,  we  are  only 
attempting  to  establish  the  profile  of  a  syndrome  if  it  exists,  and  if  the  data  necessary  to  identify 
it  is  contained  in  the  CCEP  database.  If  the  prior  statement  is  true,  then  there  will  be  a  relatively 
strong  association  between  a  finite  set  of  exposure/demographic  attributes  and  a  unique 
combination  of  outcome  diagnoses.  Likewise,  there  will  be  a  strong  association  between  a  finite 
set  of  exposure/demographic  attributes  and  a  specific  combination  of  standard  symptoms.  The 
intersection  between  diagnoses  and  symptom  combinations  with  similar  exposure  associations 
will  profile  a  candidate  syndrome.  See  Figure  #3  below. 


Standard  Symptoms 


Outcome  Diagnoses 


Reported  Exposures/Demographics 

Analysis  run  #1  identifies  high  association  between  joint  pain  and  hair  loss,  and  botulism  vaccine,  depeleted  uranium  and 
male  participants. 

Analysis  run  #2  identifies  high  association  between  memory  loss  and  fatigue  diagnoses,  and  botulism  vaccine, 
depeleted  uranium  and  male  participants. 


Figure  3.  Hypothesized  Syndrome  Profile 
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Now  our  question  of  "what  is  interesting?"  can  be  defined.  "Interesting"  is  combinations 
of  RHS  attributes  (dependent  variables)  which  are  highly  dependent  on  combinations  of  LHS 
attributes  (independent  variables),  or  in  other  words,  the  candidate  dependent  variables  are  truly 
determined  (not  independent  of)  by  the  candidate  independent  variables.  The  fitness  function 
used  must  be  such  that  hypotheses  which  demonstrate  this  property  will  be  assigned  a  relatively 
high  fitness  value.  There  are  numerous  accepted  functions  in  statistical  literature  that  fit  this 
requirement.  Several  of  these  are  discussed  in  the  next  section. 

a.         Conventional  Epidemiological  Measures 

A  great  deal  of  literature  already  exists,  like  (Goldberg,  1989)  and  (Holland, 
1975),  to  support  the  idea  that  genetic  algorithms  are  quite  successful  at  adaptively  improving  the 
quality  of  tested  rules  to  suit  the  provided  fitness  function.  From  the  outset,  our  genetic 
algorithm  demonstrated  this  quality.  However,  the  greatest  challenge  has  been  to  ensure  that  the 
search  model  adequately  represents  the  research  questions  (i.e.  the  genetic  algorithm  is  doing 
what  it  was  told  to  do,  but  have  we  provided  it  with  relevant,  meaningful  instructions?).  As  a 
starting  point  for  development  of  the  fitness  measure  for  this  research,  we  first  turned  to  classical 
epidemiology  literature. 

Classical  epidemiology  evaluates  any  test  in  terms  of  four  variables  (see  Figure  #4 
below)  which  describe  how  successfully  a  test  predicts  the  actual  presence  (or  lack)  of  a  specified 
disease.  This  is  much  akin  to  our  own  research  which  attempts  to  identify  the  success  of  a  single 
or  multiple  exposure  and/or  risk  factor  attributes  predicting  a  combination  of  symptoms  or 
clinical  diagnoses.  In  epidemiology,  these  four  variables  {a,  b,  c,  d}  are  computed  using  a  two- 
by-two  matrix  of  test  results  and  actual  disease  presence. 
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Disease 

Present  Absent 


Positive 


Test 


Negative 


a 

True  Positive 

b 

False  Positive 

C 

False  Negative 

d 

True  Negative 

PV(+) 
a/(a+b) 


PV(-) 
d/(c+d) 


Sensitivity 
a/(a+c) 


Specificity 
d/(b+d) 


Figure  4.  Classical  Epidemiological  Measures 

By  mathematically  manipulating  these  four  variables,  four  "quality"  values  are  obtained  from  the 
relationship  between  the  subject  test  and  subject  disease.  In  each  case,  keep  in  mind  that  our 
research  is  applying  the  risk/exposure  as  a  test  for  (or  indicator  of)  a  specific  symptom  and/or 
diagnosis  profile.  These  quality  values  are  (Fletcher,  1982,  pp.  43  -  57): 


Positive  Predictive  Value.  Indicates  the  ability  of  a  positive  test  result  to  accurately 
identify  the  presence  of  a  disease  in  a  patient.  This  term  is  similar  to  "confidence"  used 
as  a  fitness  measure  in  many  data  mining  tools.  We  term  this  "forward  confidence." 

a  +  b 
Negative  Predictive  Value.  Indicates  the  ability  of  a  negative  test  result  to  accurately 
determine  the  absence  of  a  disease  in  a  patient.  Most  data  mining  tools  do  not  consider 
this  measure,  but  recommend  the  analysis  be  run  with  swapped  dependent  and 
independent  variables.  This  is  not  practical  if  multiple  dependent  variables  are  being 
analyzed. 

d 


PV{-)  = 


c  +  d 
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•  Sensitivity.  The  proportion  of  subjects  with  a  disease  who  have  a  positive  test  for  the 
disease.  A  sensitive  test  will  rarely  miss  people  with  the  disease. 

...  a 

sensitivity  = 

a  +  c 

•  Specificity.  The  proportion  of  subjects  without  the  disease  who  have  a  negative  test.  A 
specific  test  will  rarely  misclassify  people  without  the  disease  as  diseased. 

d 


specificity  = 


b+d 


b.         Fitness  Measure  Paradoxes 

In  our  research,  classical  epidemiology  measures  are  helpful  in  choosing  a 
suitable  fitness  function,  but  no  single  aforementioned  measure  is  sufficient  for  several  reasons. 
Rather  we  desire  an  aggregate  fitness  measure  which  will  increase  in  response  to  any  classic 
measure  of  interest.  Fundamentally,  this  research  problem  differs  from  clinical  test  evaluation  in 
one  respect.  While  a  high  number  of  either  false  positive  (b)  or  false  negative  (c)  tests  is  a 
counter-indication  of  a  test's  quality,  it  is  also  desirable  (in  our  case)  if  a  risk/exposure 
combination  is  contraindicative  of  an  outcome  symptom/diagnosis  set.  In  certain  cases,  a  true 
positive  may  mean  nothing  because  there  are  also  many  false  positives.  In  other  cases,  a 
simultaneously  high  false  positive  and  false  negative  is  quite  informative.  This  is  best  described 
by  an  example  (Figure  #5),  but  basically,  in  the  case  of  CCEP  database  analysis,  we  are  most 
interested  in  the  hypotheses  having  highest  values  and  lowest  values  of  sensitivity  and 
specificity. 
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->  Consider  the  most  simple  hypothesis,  1  LHS  (L)  and  1 
RHS  (R)  field. 

•  If  L  and  R  are  Boolean,  there  are  four  possible  hypotheses  to  test. 

•  We  are  looking  for  more  than  just  a  high  prob(R-  tyes"|L="yes"). 


INTERESTING 
IF  L  =  "yes"  THEN  R  =  "yes" 
IF  L  =  "yes"  THEN  R  =  "no" 
IF  L  =  "no"  THEN  R  =  "no" 
IFL  =  "no"  THENR  =  "yes" 


NOT  INTERESTING 

90%  IF  L  =  "yes"  THEN  R  =  "yes"  1 0% 

1 0%  IF  L  =  "yes"  THEN  R  =  "no"  90% 

80%  IF  L  =  "no"  THEN  R  =  "no"  80% 

20%  IFL  =  "no"THENR  =  "yes"  20% 


<a/-\As  the  number  of  fields  and/or  values  per  field  increases,  the 
problem  expands  exponentially 


Figure  5.  Attribute  Value  Relationships 

c.  Alternative  Fitness  Measures 

Now  that  our  concept  of  "interesting"  has  been  framed  from  the  epidemiological 
perspective,  we  can  set  about  the  task  of  selecting  a  single  fitness  measure  which  mathematically 
describes  our  concept  of  interest  to  the  genetic  algorithm.  Again,  there  is  some  challenge  in  this 
because  there  are  several  different  measures  of  interest  to  medical  researchers  (discussed  in  the 
previous  section),  yet  the  genetic  algorithm  requires  a  single  aggregate  fitness  measure.  The 
genetic  algorithm  could  be  run  several  times  using  different  fitness  measures,  but  this  carries  a 
high  cost  in  both  processing  time  and  post-processing  analysis  effort.  Likewise,  we  have  seen 
from  the  preceding  section  that  reliance  on  any  single  measure  carries  with  it  the  possibility  of 
statistical  misinterpretation.  Two  paths  were  examined  in  this  research  to  address  this  problem, 
although  we  note  that  there  may  be  many  other  possible  solutions. 

•     Modified  J-measure.  Refer  again  to  Figure  #4  and  the  four  test  characteristics 
[PV(+),  PV(-),  sensitivity,  and  specificity].  Our  first  approach  was  to  create  a 
measure  which  was  suitably  large  when  any  of  these  four  measures  were  large  and 
suitably  low  when  none  of  the  measures  were  relatively  large— in  effect  an  aggregate 
fitness  measure.  It  should  be  noticed  from  the  foundation  we  have  laid  that  if  both  a 
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and  d  are  relatively  large  when  compared  with  b  and  c,  the  four  test  characteristics 
are  all  relatively  large.  This  would  demonstrate  that  the  risk  factors  and/or  exposures 
under  investigation  are  highly  successful  in  predicting  the  outcome  symptoms  and/or 
diagnoses  under  investigation.  Tentatively  we  will  select  the  following  formula  as 
our  fitness  measure: 

,    .,  ~        v     axd 

moajyfitness)  = 

bxc 

It  may  also  be  noticed  that  this  measure  will  effectively  indicate  if  the  outcome 

symptoms/diagnoses  are  successful  at  predicting  the  risk/exposures.  We  call  this 

property,  "reverse  confidence."  It  is  particularly  helpful  to  examine  the  two  sets  of 

attributes  with  each  assuming  the  role  of  dependent  and  independent  variables 

simultaneously.  Finally,  recall  that  unlike  the  evaluation  of  clinical  tests,  CCEP 

analysts  consider  it  interesting  if  both  false  positive  and  false  negative  values  are 

simultaneously  high  (indicating  a  risk/exposure  combination  reduces  the  probability 

of  a  symptom/diagnosis  combination).  To  account  for  this  situation,  our  j-measure  is 

modified  as  follows 

.r,axd^     ,    ,  .  axd 

//(- )>l,mod_j  =  - 

bxc  bxc 

...axd.     ,    .  .  bxc 

//(- )  <  \mod_j  = - 

bxc  axd 

(Figure  #6  gives  an  example  of  a  modified  j-measure  calculation;  note  we  use  a 

natural  log  function  to  shape  the  fitness  function  for  better  genetic  competition;  this 

will  be  discussed  in  Chapter  V): 
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"yes" 

Uranium 
Exposure 


"yes" 


mod  j  -measure  =  l  +  ln[(a*b)/(c*d)] 

1  +  ln(l  1*7505)/(84*146)  =  2.91 
Fatigue 


Sensitivity 

11/(11+146)=7.0% 


no 


a 

11 

b 

84 

c 

146 

d 

7505 

PV(+) 

1 1/(1 1+84) 

=  11.6% 


PV(-) 

7505/(146+7505) 

=  98.1% 


Specificity 
7505/(84+7505)=98.9% 


Figure  6.  Modified  J-measure  Calculations 


Chi-square.    Another  approach  to  the  question  of  fitness  function  may  be  derived 
strictly  from  statistics.  Since  our  aim  is  to  identify  risk  factors  and/or  exposures  that 
are  highly  associated  with  symptom  and/or  diagnoses  groups,  we  may  use  a 
statistical  principle  which  measures  the  independence  (not  the  same  as  the  term 
"independent  variable"  used  in  knowledge  discovery  science  to  denote  the  RHS 
variables)  of  two  groups  of  attributes.  According  to  Walpole,  et.  al,  "The  chi-square 
test  procedure...  can  also  be  used  to  test  the  hypothesis  of  the  independence  of  two 
variables  of  classification. "(Walpole,  et.  al.,  1988,  pp.  343  -  346)  The  same 
"contingency  table"  used  by  epidemiologist,  may  be  constructed  and  used  to 
compute  expected  levels  of  a,  b,  c,  and  d  based  on  the  joint  probability  function  of 
the  dependent  and  independent  variables.  (See  Figure  #7)  Observed  values  are  the 
original  values  of  a,  b,  c,  and  d,  and  expected  values  are  calculated  using  the 
following  formula: 

{column _  total)  x  (row_  total) 


Estimated _  Expected _  Value  - 


grand  _  total 
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The  chi-square  is  now  calculated  and  summed  for  all  cells  in  the  matrix.  (Chi-square 
may  be  used  for  any  size  matrix,  in  this  case  two  were  used  for  simplicity.  Since  a 
two-by-two  matrix  is  used  in  the  example,  the  formula  below  contains  the  Yates 
Correction,  which  is  not  necessary  in  larger  matrices.)    A  higher  chi-square 
indicates  a  higher  level  of  dependence  (or  lack  of  independence)  between  the  two 
attribute  sets.  The  Chi-square  formula  (with  Yates  correction)  follows;  example  chi- 
square  calculations  are  included  in  Figure  #7  : 

(\o{  -g,-l-.5)2 


x2=Z 


O; 


"yes" 

Depleted 
Uranium 
Exposure 


chi-squarea=(l  1-1. 93-.5)2/l. 93=38.05, 

chi-square(tot)  =  39.32 
Fatigue 


"yes" 


no 


11(1.93) 

b 

84(93  07) 

c 

146(155.07) 

d 

7505(7495.93) 

95 


7651 


7746 


157 


7589 


Figure  7.  Chi-square  Calculations 

The  modified  j -measure  has  been  used  by  this  research  to  date,  however  a  new  statistical  analysis 
package  designed  to  analyze  using  chi-square  is  currently  being  constructed.  A  more  straight- 
forward formula  for  Chi-square  will  actually  be  used  in  the  new  statistical  analysis  package 
(Dixon  and  Massey,  1969,  pp.  242  -  243): 

X2  =(\ad-bc\--N)2N 


(a  +  b)(a  +  c)(b  +  d)(c  +  d) 
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III.  SOLUTION  CONCEPTS 


A.       RESEARCH  GOALS 

In  the  case  of  the  Desert  Storm  research,  years  of  conventional  medical  research  have 
yielded  no  single  syndrome  or  associated  symptomatology  set.  This  means  that  the  no  fixed 
dependent  variable  set  (combinations  of  diagnoses  and/or  reported  standard  symptoms)  can  be 
readily  identified.  The  traditional  epidemiological  paradigm  is  to  isolate  a  group  of  individuals 
with  consistent  symptoms/outcome  diagnoses  and  then  find  what  key  demographic  or  exposure 
elements  these  individuals  share.  If  relating  demographic/exposure  data  are  present,  it  is  used  to 
focus  clinical  research  on  an  underlying  cause.  This  approach  has  not  proven  fruitful  to  date, 
either  because  no  syndrome  exists  or  because  the  sheer  volume  of  data  in  the  CCEP  database 
hides  a  relation  of  interest  from  human-controlled  querying.  Therefore,  we  have  chosen  to  let 
technology  simplify  the  problem  from  the  outset  of  the  knowledge  discovery  process. 

As  mentioned  before,  there  are  four  basic  categories  of  useful  data  contained  in  the 
CCEP  database  {demographics,  reported  exposures,  reported  standard  symptoms,  and  outcome 
diagnoses}.  While  attributes  in  each  category  could  prove  useful  as  independent  (LHS)  or 
dependent  (RHS)  variables,  it  is  doubtful  that  attributes  from  the  same  category  will  be  useful  as 
both  LHS  and  RHS  simultaneously.  The  research  question  is  now  simplified  to  an  examination 
of  which  attributes  (or  combinations  of  attributes)  in  each  category  are  most  highly  associated 
with  (or  statistically  dependent  on)  which  attributes  from  another  major  data  category. 

EXAMPLE  What  associative  relationships  exist  between  exposure  attributes  and 
outcome  diagnosis  attributes?  Based  on  analysis,  there  is  a  high  association  between 
reported  exposure  to  Scud  Attack  and  Depleted  Uranium  and  an  outcome  diagnosis  of 
Post-traumatic  Stress  Disorder.  [This  is  just  an  example,  not  an  actual  finding] 

This  exponentially  increases  the  size  of  prospective  search  space  which  is  represented  by 
2#lhs  *  2*RHS( where  #LHS  =  number  of  independent  fields  and  #RHS  =  number  of  dependent 
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fields  and  all  attributes  are  Boolean;  if  not  the  search  space  is  even  greater).  The  increase  in 
search  space  can  provide  useful  insight  to  medical  researchers  as  they  develop  hypotheses. 
Instead  of  waiting  for  medical  researchers  to  provide  a  more  structured  problem  (and  thereby 
reduce  the  search  space),  it  was  our  feeling  that  an  intelligent  search  technique  could  be 
employed  effectively  in  the  problem  as  given.  Therefore,  the  role  of  our  genetic  algorithm  is  to 
test  an  extremely  large  subset  of  all  fields  in  the  CCEP  database  concurrently  for  levels  of 
interest  based  on  a  specific  model  of  epidemiological  interest,  to  wit: 

0(LHS*,RHS*)  =  max(Q(LHS\RHS')) 

where  LHS'a  LHS  *  and  RHS'cz  RHS  *  and  60  =  fitness  function 

We  did  count  on  CCEP  medical  researchers  to  define  their  concept  of  "interesting"  and 
thereby  guide  our  selection  of  an  appropriate  fitness  function.  This  fundamental  shift  in 
knowledge  discovery  technique  suggests  that  a  genetic  algorithm  may  be  used  to  provide 
researchers  with  information  to  assist  them  in  framing  the  initial  research  strategy,  instead  of 
framing  the  problem  and  then  passing  it  to  a  genetic  algorithm.  We  asked  the  following  question, 
"If  a  syndrome  does  exist  and  the  data  necessary  to  identify  it  are  contained  in  the  CCEP 
database,  what  data  relationships  would  it  create  in  the  CCEP  database?"  The  answer  to  this  was 
converted  to  a  mathematical  fitness  measure.  The  resulting  combinations  of 
exposures/demographics  and  symptoms/diagnoses  discovered  will  contain  any  identifiable 
syndromes',  but  the  entire  set  of  hypotheses  will  not  all  be  guaranteed  to  be  useful  solutions.  The 
goal  is  to  present  medical  researchers  with  a  more  workable  solution  space  in  which  to  focus 
their  conventional  research  efforts.  This  approach  shifts  the  burden  of  searching  a  tremendous 
alternative  space  appropriately  onto  the  genetic  algorithm. 
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B.       SOLUTION  STRATEGY 

Our  solution  strategy  takes  two  forms,  theoretical  and  practical.  In  the  theoretical  sense, 
the  solution  strategy  rests  on  selection  of  the  most  efficient  method  of  searching  an  extremely 
large  solution  space.  There  are  three  basic  methods  of  search: 

•  Random.  In  this  type  of  search,  a  computer  program  will  randomly  generate 
hypotheses  and  pass  these  hypotheses  to  an  evaluating  routine.  The  evaluating 
routine  assigns  a  fitness  measure  to  each  hypothesis  based  on  the  fitness  function 
provided.  If  the  hypotheses  are  generated  sequentially,  this  method  is  also  know  as 
"brute  force."  This  method  tests  many  hypotheses,  because  the  hypothesis 
generation  apparatus  is  extremely  simple,  but  has  no  capacity  to  self-improve  or  tune 
the  search  to  the  operator's  goals. 

•  Human-controlled  Selective  Search.  In  this  case,  a  human  formulates  a  hypothesis 
and  translates  it  into  the  form  of  a  query.  The  query  is  evaluated  by  the  computer 
system  and  the  results  are  returned  to  the  human  operator.  It  is  assumed  that  the 
human  operator  draws  upon  practical  knowledge  of  the  problem  and  the  results  or 
prior  queries  to  formulate  new  queries.  Therefore,  the  quality  of  query  formulation 
improves  throughout  the  process.  This  allows  the  search  to  self-improve  (including 
the  human  operator  within  the  boundary  of  the  search  system)  and  obviously  tune  to 
the  operator's  goals.  However,  the  hypothesis  generation  is  extremely  slow. 

•  Systematic,  Intelligent,  Automated  Search.  A  computer  program  (genetic 
algorithm)  generates  hypotheses,  passes  them  to  an  automated  evaluator,  receives 
results,  and  then  re-generates  a  new  set  of  hypotheses  {systematically  adapting  its 
search  based  on  its  past  performance  as  indicated  in  the  results  received).  This 
technique  demonstrates  all  three  desirable  search  characteristics:  last  hypothesis 
generation,  self-improvement,  and  tuning  to  the  operator's  goals. 
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Figure  #8  illustrates  the  comparative  advantages  of  each  search  technique.  It  should  now  be 
clear,  from  a  theoretical  point  of  view,  why  a  (genetic  algorithm)  systematic,  intelligent, 
automated  search  has  been  chosen. 
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Figure  8.  Characteristic  of  Different  Search  Techniques 

Now  let  us  discuss  the  solution  strategy  on  a  more  practical  level.  Assume  for  a  moment  that  a 
genetic  algorithm  performs  a  systematic,  intelligent  search  as  theorized.  The  next  section  will 
provide  a  theoretical  basis  for  this  assumption.  From  Section  H.D.4,  we  draw  the  premise  that  a 
syndrome  will  manifest  itself  as  a  high  association  between  a  specific  combination  of 
demographic  and/or  exposure  attributes  and  a  finite  set  of  symptomatology  or  diagnoses. 
Combine  this  with  premise  that  either  a  modified  j  -measure  or  chi-square  formula  will  indicate 
the  level  of  association  (or  dependence)  between  two  sets  of  attributes.  Our  strategy  is  then  to 
instruct  the  genetic  algorithm  (DaMI)  to  find  the  most  significant  associations  between 
demographics/exposures  and  symptoms  and  between  demographics/exposures  and  diagnoses. 
These  two  analyses  will  divide  the  compete  set  of  possible  combinations  of 
demographics/exposures  into  three  categories  (note  that  demographics/exposures  are  traditionally 
viewed  as  the  independent  attribute  set): 
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•  Demographic/Exposure  combinations  which  appear  on  neither  analysis.  Any 

hypothesis  not  contained  on  either  study  indicates  that  there  is  no  statistical  basis 
within  the  CCEP  database  to  indicate  that  combination  is  a  possible  syndrome.  This 
does  not  mean  that  it  could  not  suggest  a  syndrome;  as  stated  before,  the  CCEP 
database  may  not  capture  the  appropriate  data  to  identify  the  hypothesis  as  a 
syndrome. 

•  Demographic/Exposure  combinations  are  associated  with  both  specific 
combinations  of  symptoms  and  specific  combinations  of  diagnoses.  This  is  the 
ideal  case  for  suggesting  the  existence  of  a  syndrome.  It  indicates  that  a  group  of 
PGW  participants,  sharing  both  a  common  symptomatology  and  outcome  diagnosis 
set  belong  to  the  demographic  profile  and/or  report  common  exposure  elements. 
Clinical  research  should  be  directed  toward  a  prospective  syndrome  demonstrating 
the  listed  symptoms  and  diagnoses.  Again  this  indicates  that  a  hypothesis  meets  the 
mathematical  definition  of  interesting,  but  the  possibility  of  it  being  a  syndrome  can 
only  be  confirmed  by  evaluation  by  medical  professionals. 

•  Demographic/Exposure  combinations  are  associated  with  either  specific 
combinations  of  symptoms  or  diagnoses.  A  majority  of  hypotheses  identified  by 
DaMI  will  fall  into  this  category.  If  only  one  correlation  is  made  with  the 
demographic/exposure  data,  there  is  a  weaker  indication  that  this  particular 
combination  signals  a  candidate  syndrome.  However,  failure  to  appear  on  both 
analyses  should  not  completely  discount  the  hypothesis.  As  mentioned  before,  the 
failure  of  the  CCEP  database  to  capture  all  symptomatology  or  diagnoses  may 
explain  the  appearance  of  the  demographic/exposure  combination  on  only  one 
analysis.  Therefore,  hypotheses  in  this  category  should  still  be  evaluated  by  medical 
professionals. 

Naturally,  a  certain  degree  of  ambiguity  exists  concerning  the  specific  fitness  measurement 
thresholds  with  respect  to  interest  (filtering).  Filtering  will  be  discussed  in  Chapter  VI.  But  in  a 
practical  sense,  this  analysis  will  provide  medical  researchers  with  a  prioritized  list  of  interesting 
associations.  The  central  point  is  that  most  possible  hypotheses  will  prove  statistically 
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implausible  and  therefore  fell  into  the  first  category,  suggesting  they  not  receive  costly 
conventional  medical  research  efforts. 

Finally,  many  initial  DaMI  discovery  sessions  were  devoted  to  analyzing  relationships 
between  reported  symptoms  and  outcome  diagnoses.  Early  input  from  CCEP  epidemiologists 
included  a  strong  desire  to  identify  unexpected  symptom/diagnosis  combinations.  This  study 
was  appealing  for  initial  research  because  all  attributes  involved  were  Boolean  (as  opposed  to 
demographic  and  exposure  attributes  having  more  than  two  possible  values).  The  research 
proved  statistically  successful  (discussed  in  Chapter  VI)  but  of  limited  practical  value  to  CCEP. 
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IV.  DaMI  GENETIC  ALGORITHM  ARCHITECTURE 


Up  to  this  point,  this  thesis  has  focused  on  the  theoretical  structuring  of  the  CCEP 
research  problem  and  formulating  the  qualities  of  a  genetic  algorithm  required  to  solve  the 
problem.  The  second  half  of  this  thesis  will  focus  on  describing  the  tool  developed  to  meet  these 
challenges  and  the  success  of  that  tool  in  actual  analysis.  Based  on  the  preceding  discussion,  the 
genetic  algorithm  must  be  specifically  designed: 

•  to  accept  an  unstructured  set  of  dependent  and  independent  variables 

•  efficiently  search  an  extremely  large  search  space 

•  employ  adaptive  learning,  where  a  priori  information  is  used  to  guide  future 
hypothesis  testing 

This  chapter  will  deal  with  DaMI  from  a  macro  systems  perspective;  Chapter  V  will  address  the 
details  of  the  system's  design. 


A.       PROGRAM  MODULES 

Unlike  many  other  genetic  algorithms,  the  system  designed  for  this  research  (DaMI)  has 
been  using  several  independent  modules.  These  modules  consist  of  the  genetic  algorithm  itself,  a 
statistical  package,  a  user  interface,  and  a  verification  package.  There  were  two  primary  reasons 
for  this  design  strategy.  The  first  was  to  relieve  the  genetic  algorithm  of  the  mundane  analysis 
tasks,  results  filtering,  and  user  interface  tasks,  thereby  enhancing  the  space  searching  efficiency. 
The  second  reason  was  to  aid  in  system  development.  By  adopting  a  modular  development 
approach,  a  great  deal  of  effort  can  be  focused  on  the  core  genetic  algorithm  technology  and 
allow  the  system  to  begin  rapid  prototyping  before  optimal  statistical  analysis  and  user  interface 
modules  were  developed.  Once  the  core  genetic  algorithm  is  properly  functioning,  more  robust 
statistical  engines  and  user  options  may  be  added,  using  experience  gained  from  test  runs.  A 
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more  in-depth  explanation  of  the  genetic  algorithm  (GA)  operation  is  contained  in  the  next 
chapter.  Figure  #9  shows  the  relationship  between  the  DaMI  modules. 
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Figure  9.  Relationship  of  DaMI  Modules 


1.         The  Genetic  Algorithm  Package 

The  genetic  algorithm  package  is  responsible  for  maintaining  a  list  (population)  of 
hypotheses  (rules)  in  the  current  generation,  selecting  the  most  successful  rules,  and  performing 
the  genetic  operations  of  reproduction,  crossover,  and  mutation.  These  genetic  operators  allow 
the  system  to  adapt  the  analysis  to  the  goal  model  (fitness  function)  and  improve  the  search 
hypotheses  as  each  generation  is  processed.  In  this  thesis,  "hypothesis"  and  "rule"  are  used 
interchangeably;  "hypothesis"  is  a  medical  research  term  and  "rule"  is  a  artificial  intelligence 
term.  Clearly,  not  all  possible  hypotheses  will  be  tested  (hence  the  advantage  of  the  genetic 
algorithm),  but  the  use  of  genetic  operators  ensures  that  the  rules  being  tested  have  the  highest 
probability  of  satisfying  the  given  fitness  function  (Holland,  1975).  In  the  DaMI  system,  the 
genetic  algorithm  stores  hypotheses  as  combinations  of  attributes  only,  not  as  combinations  of 
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attributes  and  specific  values.  Competition  is  based  on  success  of  attribute  sets  as  a  whole. 
Attribute  sets  (like  gender,  receiving  the  botulism  vaccine,  exposure  to  uranium  [independent 
variables]  and  Depression  and  Chronic  Fatigue  Syndrome  [dependent  variables])  are  passed  to 
the  statistical  package,  which  returns  an  aggregate  fitness  value  for  all  possible  value 
combinations  of  those  attributes.  The  statistical  package  is  called  recursively  during  the 
processing  of  a  single  generation  for  every  rule,  until  the  entire  generation  is  evaluated.  Then  the 
genetic  algorithm  produces  the  next  generation  and  the  process  is  repeated. 

2.         The  Statistical  Analysis  Package 

The  statistical  analysis  package  receives  a  set  of  independent  and  dependent  attributes  to 
evaluate  from  the  genetic  algorithm  package.  The  statistical  package  requires  no  information 
other  than  a  list  of  field  names  to  evaluate.  The  number  of  attributes  in  each  request  sent  to  the 
statistical  package  varies,  so  it  must  be  capable  of  processing  loosely  bounded  problems. 
During  pre-processing,  the  analysis  database  (database  under  analysis;  in  this  case  the  CCEP 
Persian  Gulf  War  Database)  is  examined  and  a  table  is  created  of  all  attributes  and  their  possible 
values.  This  table  is  used  as  the  source  for  generating  each  individual  query  (there  are  many 
individual  queries  generated  to  answer  each  request  form  the  genetic  algorithm)  and  ensuring  that 
each  possible  combination  is  tested  but  only  once.  The  statistical  package  then  computes  the 
fitness  of  each  possible  attribute/value  combination.  An  aggregate  fitness  measure  is  then 
computed  and  returned  to  the  genetic  algorithm  package.  As  the  statistical  package  tests 
attributes  against  the  database  under  analysis,  it  also  performs  a  test  of  each  attribute/value 
combination  against  a  second  database.  This  second  test  is  not  returned  to  the  genetic  algorithm 
and  therefore  does  not  affect  hypothesis  competition.  This  value  is  stored  to  be  used  later  for 
results  validation  (see  section  V.C). 
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3. 


User  Interface 


The  user  interface  controls  interaction  between  DaMI  and  the  system  operator.  The  user 
interface  allows  the  user  to  adjust  tunable  parameters  (discussed  in  Chapter  V),  view  the 
discovery  database  at  various  stages  of  processing,  and  start  and  reset  the  genetic  algorithm 
package.  The  user  interface  also  provides  intermediate  feedback  to  the  user  during  DaMI 
operation.  It  was  designed  using  the  Foxpro  Screen  Design  Wizard  and  is  controlled  by  push 
buttons  and  pop-up  menus.  Settings  may  not  be  adjusted  "on-the-fly"  when  the  genetic 
algorithm  is  operating.  An  example  of  the  user-interface  screen  is  shown  in  Figure  #10  below. 
The  user-interface  module  is  disposable,  and  therefore  an  in-depth  discussion  of  the  user- 
interface  design  is  not  included  in  this  thesis. 
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Figure  10.  DaMI  User  Interface 
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B.       REPORTING  AND  FILTERING 

Once  a  discovery  session  has  been  completed  by  DaMI,  several  files  are  created.  A 
transcript  of  each  hypothesis  individual  (at  the  attribute  level)  of  every  generation  is  created  as 
DaMI  operates,  along  with  a  transaction  record  of  each  genetic  operation  employed,  the  source 
(parent)  rules,  and  resulting  offspring    The  transaction  record  also  maintains  a  time  stamp  at  the 
start  of  each  generation  which  can  be  used  to  monitor  processing  speed.  DaMI  also  records  how 
many  actual  combination  were  tried  during  the  session.  These  files  will  not  be  discussed  in 
detail  (file  structures  are  contained  in  Appendix  B). 

The  most  important  file  created  (rulelib.dbf)  contains  a  list  of  every  hypothesis  tested  and 
used  to  determine  an  aggregate  fitness  measure  (without  duplication).  Several  key  points  must 
be  cleared  up  at  this  juncture.    First,  not  every  possible  attribute/value  combination  is  used  to 
compute  the  aggregate  fitness  value  of  a  given  attribute  set  (this  is  a  tunable  parameter).  Second, 
Rulelib.dbf  stores  attribute  and  value  combinations  (as  opposed  to  the  session  transcript  which 
records  only  the  higher-level  attribute  sets).  It  also  contains  the  intermediate,  final,  and 
verification  fitness  measures.  This  makes  rulelib.dbf  the  actual  answer  produced  by  DaMI. 
Figure  #1 1  is  an  excerpt  from  rulelib.dbf. 
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Figure  11.  Rulelib.dbf  Display 

Finally,  whatever  fitness  measure  is  used  will  probably  not  have  an  arbitrary  threshold  of 
"interest."  A  fitness  measure  is  only  useful  in  ranking  the  relative  interest  of  hypotheses  tested; 
therefore  some  form  of  filtering  will  be  done  prior  to  reporting.  However,  it  is  inadvisable  to 
enforce  that  filter  during  operation.  Instead,  rulelib.dbf  is  left  in  the  most  robust  (non- 
summarized)  form  practical;  filtering  is  performed  arbitrarily  using  SQL  type  query  language  on 
a  case-by-case  basis  for  each  report. 

Several  reports  have  been  developed  in  Foxpro  for  the  DaMI  system.  However,  as  with 
filtering,  reports  are  tailored  to  suit  the  needs  of  each  individual  recipient.  Summary  reports  are 
created  on  an  ad-hoc  basis;  there  is  a  standard  detailed  report  which  contains  hypotheses  and  all 
intermediate  and  final  statistical  computations.  The  detailed  reports  (two  main  studies  were 
conducted  in  this  thesis)  of  the  top  100  hypotheses  discovered  are  contained  in  Appendix  C. 
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C.       SYSTEM  REQUIREMENTS 

1.         Hardware  and  Software  Requirements 

From  the  outset,  the  author's  goal  was  to  construct  a  research  tool  and  methodology  that 
can  be  employed  by  researchers  in  their  community,  without  the  need  for  a  laboratory  of  (scarce) 
high-power  computer  assets.  In  any  case,  it  has  already  been  shown  that  raw  processing  power  is 
quickly  overcome  by  large  unstructured  database  analysis  requirements.  Therefore,  a  genetic 
algorithm  is  used  to  intelligently  enhance  the  processing  capabilities  of  whatever  platform  it  runs 
on.  In  keeping  with  this  goal,  DaMI  was  designed  to  operate  on  a  standard  personal  computer 
using  inexpensive  commercial  software.  The  hardware  and  software  requirements  required  to  run 
DaMI  are  listed  below: 

Hardware  Requirements 

Personal  Computer,  80486/66Mhz  processor  or  better 

8  Megabytes  of  RAM 

200  Megabytes  office  hard  disk  storage 

Software  Requirements 

Microsoft®  Visual  Foxpro  version  3.0 

Microsoft®  Windows  version  3.xx  or  Windows  95 

Surpassing  the  minimum  hardware  requirements  will  of  course  benefit  system  performance.  The 
most  dramatic  performance  improvements  will  be  realized  by  increasing  RAM  and  the  access 
speed  of  the  PC  hard  drive. 
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2.         Processing  Limits 

DaMI  is  primarily  limited  by  the  time  available  to  the  user  to  complete  the  analysis; 
however,  there  are  some  processing  limitations.  For  the  preservation  of  system  speed,  DaMI 
maintains  the  active  population  in  a  RAM-based  array.  Therefore,  it  is  limited  by  the  maximum 
array  size  allowed  in  Foxpro.  The  required  array  size  is  a  function  of  population  size  per 
generation  and  number  of  attributes  under  analysis.  The  formula  for  this  metric  is: 

population   size  x  analysis  fields  <  73,500 

Under  this  limitation,  analysis  of  70  field  with  a  population  size  of  15,000  (array  size  1,050,000) 
would  exceed  the  system  limits.  Only  the  number  of  fields  actually  under  analysis  is  used  in  this 
calculation,  not  the  number  of  fields  in  the  database  being  analyzed.  Also,  the  number  of  records 
in  the  analysis  database  is  limited  only  by  the  maximum  Foxpro  table  size  (Maximum  records 
per  table  file  =  1  billion,  Maximum  size  of  a  table  file  =  2  gigabytes,  Maximum  fields  per  record 
=  255  ).  Naturally,  larger  files  will  take  longer  for  the  statistical  package  to  analyze. 
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V.  SEARCHING  THE  HYPOTHESIS  SPACE:  DaMI 
IMPLEMENTATION 


A.       THE  GENETIC  ALGORITHM 

The  basic  architecture  of  the  DaMI  Genetic  Algorithm  is  based  on  (Goldberg,  1986), 
with  the  notable  exception  that  our  genetic  algorithm  stores  rules  as  strings  of  Boolean  attributes 
(ntmen=consider  the  attribute;  "false"=don't  consider  the  attribute).  This  allows  the  genetic 
algorithm  to  process  simple  binary  strings,  as  opposed  to  strings  of  field  values  and  wildcards 
(Goldberg  uses  a  "*"  to  denote  any  value  of  this  attribute  is  acceptable).  This  does  not  imply  that 
the  genetic  algorithm  is  simplistic,  in  fact  competition  of  attributes  in  aggregate  actually  provides 
for  a  more  efficient  search  of  the  alternative  space.  As  can  be  seen  in  Figure  #12,  a  conventional 
genetic  algorithm  will  operate  hypotheses  as  combinations  of  attributes  and  values.  In  our  case, 
this  prevents  the  genetic  algorithm  from  considering  the  associations  between  risk  factors 
(exposures/demographics)  and  outcomes  (symptoms/diagnoses)  in  aggregate.  By  using  the 
DaMI  methodology,  risk  factors  and  outcome  associations  (hypotheses)  are  examined 
comprehensively  before  competing  for  selection  and  genetic  recombination. 
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Conventional  Genetic  Algorithm  Representation  (Goldberg,  1989) 

I              I              I                I              I 

Demographics       Reported  Exposures 
Rirte   Gender   Service   Uranium  OH  Smoke  Combat  Anthrax 

1     |male        |Navy       |Yes         |*                |*              |No 

outcome  Diagnoses 

Fatigue   Depression    Memory  Loss 

|Yes                 |* 

Rule  1  indicates  a  relationship  between  Male  Navy  personnel  who  reported  exposure  to  Uranium  but  not 

Anthrax  and  an  outcome  diagnosis  including  Depression 

DaMI  Genetic  Algorithm  Representation 

I              I                I 

Rule 

2 

Demogra 
Gender 

TRUE 

ptncs 
Service 

TRUE 

Reported  Exposures 
Uranium  OR  Smoke  Combat 

TRUE        FALSE       FALSE 

Anthrax 
TRUE 

Outcome 
Fatigue 

FALSE 

Diagnoses 
Depression 

TRUE 

Memory  Loss 

FALSE 

Rule  2  indicates  a  relationship  between  gender,  service,  reported  exposure  to  uranium  and/or 

Anthrax  and  whether  or  not  the  patient  was  diagnosed  with  Depression 

Figure  12.  Conventional  and  DaMI  Algorithm  Representations 

This  genetic  algorithm  uses  a  "roulette  wheel"  (Goldberg,  1989)  model  for  competitive 
selection  with  the  size  of  each  rule's  "slice"  (or  probability  of  selection)  being  directly 
proportional  to  the  fitness  measure  (determined  by  the  statistical  package)  of  each  rule.  Slices  are 
selected  for  reproduction,  crossover,  and  mutation  randomly,  but  the  "size"  of  each  slice  gives  a 
proportionally  higher  chance  of  survival  to  rules  with  higher  fitness.  As  individual  rules  show 
reproductive  dominance,  these  individuals  may  possess  more  than  one  slice  on  the  roulette 
wheel,  (i.e.  a  particularly  strong  rule  may  reproduce  more  than  once  per  generation,  giving  it 
more  than  one  slice  on  the  subsequent  generation's  roulette  wheel).  We  chose  the  roulette  wheel 
(Goldberg,  1989)  because  it  allows  the  stronger  rules  to  dominate  more  quickly  than  with  other 
methods  (e.g.  rank  or  tournament)  and  thereby  converge  faster.  The  basic  genetic  operators 
(reproduction,  crossover,  and  mutation)  are  all  implemented  in  DaMI,  with  operator  adjustable 
profiles  (see  section  V.D). 
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B.       THE  STATISTICAL  ANALYSIS  ALGORITHM 

The  DaMI  statistical  package  in  use  is  a  fairly  simple  algorithm.  The  modular  design  of 
our  system  allows  for  the  replacement  of  this  statistical  package  with  a  more  robust  commercial 
package  in  the  future.  At  this  point,  the  cost  of  designing  an  interface  outweighs  potential 
benefits;  this  may  not  be  true  for  more  complex  analysis  projects. 

Given  a  set  of  dependent  attributes  (RHS)  and  independent  attributes  (RHS),  the 
statistical  package  creates  a  two-dimensional  array  of  attributes  and  possible  values.  The  array 
also  contains  the  number  of  possible  values  for  each  attribute  and  a  counter  for  each  attribute.  As 
the  statistical  algorithm  processes  each  combination,  the  counter  for  each  attribute  is  incremented 
accordingly  using  the  base  counting  of  each  attribute  corresponding  to  that  attribute's  number  of 
possible  values,  (i.e.  if  the  attribute  "gender"  had  two  possible  combinations  then  its  counter 
would  increment  in  base  2;  if  the  attribute  "state"  had  fifty  combinations  then  its  counter  would 
increment  in  base  50).  The  algorithm  uses  each  individual  attribute's  current  counter  value  to 
reference  a  cell  in  the  array.  The  cell  values  and  attribute  names  are  used  to  create  a  textual  query 
statement.  The  query  statement  is  then  applied  to  the  analysis  database  and  the  fitness  measure  is 
applied  to  the  result.  This  allows  the  same  statistical  algorithm  to  loop  recursively  with  a 
minimum  amount  of  software  code,  regardless  of  the  number  of  attributes  passed  to  it  by  the 
genetic  algorithm. 

Several  fitness  measures  have  been  used  (see  the  discussion  in  section  n.E.4).  Our  goal, 
since  medical  researchers  seek  associations  between  patient  risk  factors/exposures,  reported 
symptoms,  and  resulting  diagnoses,  is  to  award  the  highest  fitness  values  to  those  LHSs  and 
RHSs  which  are  most  highly  interdependent  (vice  independent).  Since  each  request  from  the 
genetic  algorithm  generates  many  individual  statistical  package  queries,  some  means  of 
aggregating  the  fitness  measures  of  all  possible  combinations  is  required.  Several  different 
methods  for  determining  the  aggregate  fitness  measure  were  considered.  Obviously,  an  average 
of  all  fitness  measures  for  a  given  attribute  set  is  non-competitive.  In  many  cases,  the  highest 
individual  fitness  measure  has  been  used  because  of  the  specificity  of  the  research  question.  In 
other  cases,  an  aggregate  measure  may  be  taken  using  Chi-square  or  an  average  of  the  top  three 


53 


or  four  j -measures  (use  of  an  aggregate  value  limits  the  awarding  of  a  high  fitness  measure  based 
on  a  single  unexpected  outlier  in  the  research  database). 

A  rule  cacher  (like  a  disk  cacher,  except  for  hypotheses)  is  used  to  prevent  duplicate 
evaluation  of  any  rule  throughout  the  discovery  session.  A  table  of  rules  evaluated  by  the 
statistical  package  and  resulting  fitness  values  in  maintained.  Before  sending  a  rule  to  the 
statistical  package,  the  genetic  algorithm  checks  the  table  of  rules  already  evaluated.  If  the  rule 
has  been  previously  evaluated,  the  genetic  algorithm  uses  the  fitness  value  from  the  cache  table. 
If  not,  the  genetic  algorithm  package  sends  the  rule  to  the  statistical  package  and  establishes  a 
new  entry  (with  resulting  fitness)  in  the  cache  table. 

C.       TUNABLE  PARAMETERS 

The  program  has  several  tunable  parameters  to  adjust  genetic  algorithm  operation. 
Tunable  parameters  are  set  via  the  user  interface  at  the  commencement  of  each  discovery  session. 

•  Crossover  probability,  probability  that  a  selected  rule  will  exchange  information  with 
another  selected  rule 

•  Mutation  probability,  probability  that  a  selected  rule  will  undergo  a  random  mutation 

prob(reproductiori)  =  100%  -  (prob(crossover)  +  prob(mutation)) 

•  Population  size,  number  of  individual  rules  in  each  generation  number  of  generations  to 
simulate 

•  Maximum  rule  complexity,  maximum  number  of  dependent  and  independent  attributes 
allowed  in  each  hybrid  rule  (set  individually  for  dependent  and  independent) 

•  Average  complexity  of  initial  rule  set.  average  number  of  dependent  and  independent 
attributes  allowed  in  each  rule  of  randomly  generated  initial  population 

•  Top  rules  to  aggregate,  number  of  rules  (in  order  of  decreasing  fitness)  to  use  in 
computing  aggregate  fitness  by  the  statistical  package 
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D.       PROBLEMS  AND  IMPROVEMENTS 

Before  this  discussion  of  DaMI  implementation  is  concluded,  we  would  like  to  discuss 
some  of  the  problems  encountered  in  our  implementation  and  our  solutions  to  these  problems. 
We  found,  as  many  other  researchers  have,  that  genetic  algorithms  are  quite  successful  at 
adaptively  improving  the  quality  of  tested  rules  to  suit  the  provided  fitness  function.  However, 
the  greatest  challenge  has  been  to  ensure  that  our  search  model  adequately  represented  the 
research  questions  (i.e.  the  genetic  algorithm  is  doing  what  it  was  told  to  do,  but  have  we 
provided  it  with  accurate  instructions).  Our  focus  on  problems  with  proper  tuning  of  the  genetic 
algorithm  should  in  no  way  degrade  the  perception  that  a  genetic  algorithm  is  an  extremely  fast 
and  effective  search  technique.  It  does  work  as  advertised!. 

1.         Convergence  Issues 

One  challenge  faced  by  our  research  was  to  ensure  that  the  algorithm  would  effectively 
(not  necessarily  physically)  test  the  entire  search  space.  A  genetic  algorithm  will  rapidly 
(especially  using  roulette  wheel  competition)  improve  the  average  fitness  measure  of  rules  within 
successive  generations,  but  in  many  cases,  the  speed  of  improvement  degraded  the  algorithm's 
ability  to  comprehensively  examine  the  search  space. 

It  should  be  recalled  from  genetic  search  theory  (Holland,  1975)  that  search  regret  (or 
missed  rules  of  interest)  is  minimized  if  attributes  of  successful  rules  are  tested  in  exponentially 
more  combinations  in  successive  generations,  and  attributes  of  unsuccessful  rules  are  tested 
exponentially  fewer  times.  This  is  implemented  in  a  genetic  algorithm  by  giving  successful  rules 
a  higher  chance  of  selection  (and  thereby  the  chance  to  mix  information  with  other  successful 
rules)  based  on  the  level  of  their  fitness  measure.  Naturally,  successful  rules  begin  to  dominate 
the  population  (in  our  case  take  up  more  slots  on  the  roulette  wheel)  and  increase  the  chance  that 
their  constituent  attributes  are  used  for  future  rules.  A  problem  arises  when  the  fitness  measure  of 
a  mediocre  rule  is  disproportionately  larger  than  the  other  individuals  of  its  generation.  If  this 
mediocre  rule  dominates  the  population  too  quickly  then  it's  attributes  provide  the  only  material 
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for  future  rules.  The  resulting  phenomenon  is  called  premature  convergence  (Koza,  1988)  and 
will  prevent  comprehensive  search  of  the  entire  space. 

Several  steps  were  taken  to  prevent  this,  but  generally  speaking,  great  care  must  be  used 
in  selecting  a  fitness  measure.  If  the  slope  of  fitness  in  proportion  to  rule  quality  is  too  great, 
premature  convergence  is  likely.  The  author  chose  to  apply  a  natural  logarithm  scale  to  the 
fitness  measure.  This  gave  a  strong  relative  advantage  to  good  rules  over  weak  rales,  but  slowed 
the  domination  of  good  rales  (or  local  maximums)  over  their  slightly  weaker  peers.  The  author 
also  developed  a  technique  called  same-parent  crossover  randomization.  Basically  speaking,  if 
two  identical  parents  are  selected  for  crossover,  the  resulting  "offspring"  are  duplicates  of  the 
parents.  In  our  crossover  operator,  if  the  two  parents  are  the  same,  a  single  parent  is  randomly 
bisected  into  two  offspring.  Each  offspring  receives  a  portion  of  the  parents  genetic  material 
(attributes)  and  a  portion  of  randomly  generated  material.  This  has  no  effect  on  the  algorithm  at 
early  stages,  but  it  increases  the  mutation  probability  strongly  as  the  population  becomes 
dominated  by  a  few  rales  (which  causes  the  crossover  operator  to  loose  its  ability  to  effectively 
generate  new  hypotheses,  see  Figure  #13). 


'Cumulative  Fitness 
•Crossover  % 
•Mutation  % 


_-  As  population  prematurely 
converges: 

Crossover  effectiveness 
decreases  and 
Same-parent  Cross 
increases  mutations 


Figure  13.  Effect  of  Same-parent  Crossover  Randomization 
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Finally,  it  was  noted  that  since  a  genetic  algorithm  is  based  on  probabilistic  selection, 
some  extremely  strong  rules  failed  to  be  survive  (by  sheer  chance)  despite  their  selective 
advantage.  This  is  an  understandable  consequence  of  natural  selection;  sometimes  more  capable 
species  die  solely  because  of  "bad  luck."  The  author  reserved  several  spaces  on  the  roulette 
wheel  for  the  rules  with  the  highest  fitness  measure  in  the  population,  regardless  of  their 
selection  by  the  algorithm.  This  ensures  that  an  extremely  "good"  rule  will  continue  to  be 
available  for  selection  and  recombination  in  successive  generations. 

2.         Processing  Speed  Issues 

However  sophisticated  the  search  technique  may  be,  we  must  still  keep  the  magnitude  of 
this  search  problem  in  mind.  One  of  our  research  goals  was  to  ensure  that  the  technology  created 
did  not  require  sophisticated,  expensive,  or  proprietary  hardware  or  software.  For  this  reason  the 
DaMI  application  was  developed  to  run  on  a  80486/66Mhz  personal  computer  using  the 
Microsoft  Window  3.xx  or  Windows  95  operating  system.  (Pentium  166's  are  used  for 
production  runs.)  A  very  simple  problem  such  as  analyzing  relations  between  15  standard 
symptoms  and  21  diagnoses  (Boolean  fields)  yields  a  search  space  of  69  billion  combinations.  A 
486  computer,  using  the  "brute  force"  method,  can  test  about  600,000  hypotheses  (rules)  per  day. 
At  that  rate,  this  problem  would  take  more  than  3 15  years  to  complete.  Even  if  the  speed  of 
processing  could  be  accelerated  by  a  factor  of  100,  the  problem  would  still  be  unpractically  large. 
We  have  processed  runs  involving  exposures/demographics  and  diagnoses  that  were  on  the  order 
of  9.457  *  101  .  Actual  processing  benchmarks  are  included  later  in  the  paper,  but  the  point  for 
the  moment  is  that  results  using  genetic  algorithms  take  days  not  minutes  to  achieve. 

Naturally  the  author  took  several  steps  to  enhance  speed  on  the  given  PC  architecture. 
First,  the  population  of  rules  is  maintained  in  a  RAM-based  array  space  as  is  the  statistical 
package's  attribute  and  possible  value  matrix.  This  allows  the  genetic  operations  to  be  carried  out 
with  extreme  speed.  Task  complexity  is  not  really  a  speed  issue  at  all  for  the  genetic  algorithm 
package;  unfortunately,  the  database  under  analysis  cannot  be  placed  in  RAM,  so  the  statistical 
package  becomes  the  speed  limiting  operation.  Genetic  operations  take  several  seconds  per 
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population,  but  the  statistical  package  may  take  hours  to  analyze  a  single,  large  population.  In  the 
case  of  the  statistical  package,  number  of  attribute  and  possible  values  is  much  more  significant 
than  the  number  of  records  in  the  analysis  database.  If  the  operating  architecture  could  be 
enhanced  to  allow  the  genetic  algorithm  to  pass  statistical  requests  to  multiple  personal  computer 
nodes,  a  significant  processing  advantage  could  be  attained. 

The  nature  of  our  research  question  concerning  a  possible  syndrome  affecting  Persian 
Gulf  War  participants  limits  the  complexity  requirement  of  rules  generated.  In  other  words,  rules 
involving  too  many  attributes  may  be  statistically  significant,  but  are  so  specific  that  they  may 
only  describe  a  single  participant.  Naturally,  these  rules  may  have  a  selective  advantage  over  less 
specific  rules,  because  a  single  outlier  reporting  a  highly  unusual  combination  of  attributes  will 
be  very  highly  rated.  However,  rules  involving  a  single  individual  do  not  suggest  a  syndrome, 
which  by  definition  is  a  series  of  conditions  affecting  a  group  of  individuals.  Therefore,  we 
included  a  tunable  parameter  which  limits  the  maximum  complexity  of  rules  generated.  Rules 
involving  too  many  attributes  are  given  a  low  fitness  function  and  are  not  sent  to  the  statistical 
analysis  package.  It  should  be  obvious  that  increasing  the  number  of  attributes  in  a  single  rule 
exponentially  increases  the  complexity  of  the  analysis  by  the  search  package. 

3.         Tuning  the  Fitness  Measure,  Verification,  and  Validation 

One  of  greatest  challenges  faced  is  to  develop  a  fitness  that  accurately  reflects  the 
requirements  of  CCEP  medical  researchers.  It  is  critical  that  feedback  is  obtained  at  every  step  of 
the  discovery  process. 

EXAMPLE  Just  because  there  is  a  high  association  between  hair  loss  and  chronic 
fatigue  syndrome  within  the  database  under  examination  does  not  mean  that  this  is  of 
any  medical  significance. 

It  must  also  be  understood  that  our  technique  has  drastically  reduced  the  number  of 
correlations  to  be  investigated  by  medical  researchers,  but  it  does  not  guarantee  that  each  rule  is 
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of  value.  That  knowledge  can  only  be  obtained  from  medical  professionals.  Our  goal  is  to 
provide  a  catalyst  for  their  research  and  a  "jumping  off  point"  for  more  in-depth  clinical 
investigation.  If  that  mindset  is  maintained,  the  genetic  algorithm  is  proving  most  helpful. 

Verification  is  also  a  key  issue.  Rules  and  their  associated  fitness  measures  generated  by 
a  genetic  algorithm  will  be  true.  That  has  been  easily  verified  by  conventional  query.  Ensuring 
that  the  rules  generated  are  the  best  ones  to  describe  the  analysis  database  is  more  challenging. 
We  have  two  different  methods  for  responding  to  this  challenge,  duplicability,  and 
reproducibility. 

The  database  of  19,000  records  has  been  split  into  several  sample  sets.  Each  sample  set  is 
selected  randomly  without  replacement.  We  actually  use  two  database  subsets  of  around  7,700 
records  each.  The  genetic  algorithm  is  applied  to  one  sample  subset  and  its  output  rules  are  then 
applied  to  the  second  subset.  If  the  fitness  measure  for  a  rule  is  uniform  throughout  the  two 
independent,  randomly-selected  databases,  then  there  is  confidence  that  this  rule  holds  for  the 
entire  database  and  is  not  a  statistical  anomaly.  We  call  this  attribute  duplicability. 

The  second  verification  procedure  is  reproducibility.  It  cannot  be  proven  that  a  genetic 
algorithm  has  actually  found  the  best  rules  for  a  given  search  space.  The  only  way  to  accomplish 
this  is  to  actually  check  every  possible  combination,  which  we  have  already  stated  is  physically 
impractical.  How  then  may  we  have  any  certainty  that  the  technique  has  worked;  that  the 
algorithm  has  used  a  sufficiently  large  population  over  a  sufficiently  large  number  of  generations 
to  achieve  an  acceptable  answer?  Since  a  genetic  algorithm  depends  on  the  simulation  of  survival 
of  the  fittest  (Darwinism)  based  solely  on  probability  modeling  and  random  number  generation, 
it  will  never  analyze  the  same  problem  the  same  way  twice.  We  run  every  problem  twice  and 
note  the  number  of  rules  that  occur  in  both  outcome  rule  sets.  If  both  independent  discovery 
sessions  produce  a  high  number  of  the  rule  intersections,  then  this  indicates  that  the  state  space 
has  been  searched  exhaustively  (see  Figures  #14  and  #15).  If  this  is  not  the  case,  then  the 
population  size  and/or  number  of  generations  must  be  increased  for  an  effective  discovery 
session. 
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A  large  number  of  the  highest 
fitness  rules  are  discovered  b 
all  three  runs.  This  suggests 
a  comprehensive  search  of  th 
alternative  space 


Orun#l 
C3  run  #2 
O  run  #3 


■  hypothesis  discovered  by  all  three  runs  (larger 
x's  indicate  larger  fitness  measures) 
X  X  x .  -  hypothesis  not  discovered  by  all  three  runs 


Figure  14.  Strong  Reproducibility  in  GA  Search 


Little  or  no  intersection 
between  hypotheses  dis- 
covered by  independent  runs. 
Suggests  search  space  has 
not  been  effectively  searched 


O  run#l 
O  run  #2 
O  run  #3 


X  X  x  .  -  hypothesis  discovered  by  all  three  runs  (larger 

x's  indicate  larger  fitness  measures) 
X  X  x  .  -  hypothesis  not  discovered  by  all  three  runs 


Figure  15.  Weak  Reproducibility  of  GA  Search 

Finally,  a  great  deal  of  emphasis  is  placed  on  the  discovery  of  rules  which  are  intuitively 
obvious  to  medical  professionals.  This  may  appear  insignificant  at  first,  but  as  mentioned  before 
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genetic  algorithms  are  unguided  random  processes  possessing  no  knowledge  of  medical  facts .  If, 
through  their  learning  process,  they  produce  a  series  of  rules  that  mimic  accepted  medical 
knowledge  then  this  lends  confidence  that  accompanying  rules,  which  do  not  make  intuitive 
sense,  may  contain  new  and  significant  information. 
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VI.  RESULTS 


A.       SUMMARY 

DaMI  has  achieved  striking  successes  throughout  our  experiments.  The  theoretical  basis 
for  the  design  of  this  search  algorithm  is  sound  and  has  allowed  this  system  to  perform  and 
produce  results.  DaMI  is  a  very  exciting  application  because  its  performance  matches  or  exceeds 
theoretical  expectations,  and  it  identifies  previously  undiscovered  correlations  in  the  CCEP 
Desert  Storm  Database.  In  this  chapter,  we  will  characterize  the  initial  success  of  DaMI  by 
presenting  a  series  of  experimental  results  which  build  on  the  framework  developed  by  this 
thesis.  Success  in  this  research  is  metered  by  responding  to  the  following  questions: 


• 


• 


Did  the  Genetic  Algorithm  (DaMI)  perform  as  theoretically  predicted? 

What  correlations  did  the  Genetic  Algorithm  actually  find  in  the  CCEP  database,  and 

were  these  hypotheses,  at  least  from  a  statistical  perspective,  consistent  with  the 

research  goals? 

How  useful  were  the  hypotheses  discovered  to  CCEP  medical  researchers? 


Each  will  be  examined  individually  in  the  following  sections  of  this  chapter,  building  up  to  a 
comprehensive  evaluation  of  DaMI' s  theoretical  as  well  as  practical  performance. 

Twenty-five  discovery  sessions  (runs)  have  been  conducted  by  DaMI  thus  far,  of  which 
six  production  runs  are  discussed  in  the  results  section.  Earlier  runs  were  used  to  test  the 
performance  of  DaMI  during  development  and  refine  the  settings  of  tunable  parameters  for 
optimal  discovery.  Genetic  algorithm  development  is  a  constant  process  of  discovery,  feedback 
and  refinement.  The  runs  conducted  to  date  are  by  no  means  all-inclusive,  but  rather  chronicle  a 
successful  venture  into  the  CCEP  database. 

DaMI  has  been  directed  to  analyze  two  different  perspectives  of  the  CCEP  database 
(three  identical  production  runs  for  each  perspective).  The  first  runs  search  for  associations 
between  the  gender,  service,  race,  and  reported  exposures  of  PGW  participants  (LHS)  and  the 
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diagnoses  that  were  assigned  by  the  CCEP  medical  examination  process  (RHS).  We  refer  to 
these  runs  as  exposure-to-diagnosis  runs.  The  second  set  of  runs  search  for  associations  between 
gender,  service,  race,  and  reported  exposures  of  PGW  participants  (LHS)  and  the  standard 
symptoms  mat  were  elicited  during  the  CCEP  medical  examinations  (RHS).  We  refer  to  these 
runs  as  exposure-to-symptom  runs.  The  reader  is  referred  to  Appendix  A  for  a  detailed  list  of 
fields  included  in  each  analysis.  Each  production  run  utilized  a  population  size  of  1000,  cross- 
over probability  of  30%,  mutation  probability  of  3.0%  (see  section  V.C  for  a  discussion  of 
tunable  parameters).  Modified  j -measure  has  been  used  as  a  fitness  measure,  and  only  the  single 
best  j  -measure  of  all  combinations  of  each  individual  attribute  set  was  used  for  aggregate  fitness 
by  the  statistical  analysis  package  (see  section  V.B).  Hypotheses  generated  were  limited  to 
combinations  of  up  to  three  LHS  attributes  and  two  RHS  attributes.  Production  runs  have 
simulated  at  least  130  generations;  some  were  allowed  to  continue  for  170  generations. 


B.       DID  THE  GENETIC  ALGORITHM  PERFORM  AS 
EXPECTED? 


As  theoretically  predicted,  DaMI  performs  very  well,  in  terms  of  speed,  hypothesis 
quality  improvement,  and  search  space  coverage.  This  question  focuses  solely  on  the  ability  of 
DaMI  to  perform  an  efficient,  self-improving  search  and  not  on  the  value  of  results  to  medical 
professionals  (which  will  be  discussed  in  the  next  section).  The  tremendous  size  of  the  search 
space  has  been  mentioned  earlier,  but  the  number  of  possible  combinations  should  be  presented 
specifically  at  this  point: 

•     Exposure-to-diagnosis  Runs.  29  Boolean  reported  exposures,  gender  (2  possible 
values),  service  (6  values),  race  (8  values),  and  21  Boolean  diagnoses. 


Possible  combinations  =  2^  x  2  x  6  x  7  x  221  =  9.46  x  10 
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•     Exposure-to-symptom  Runs.  29  Boolean  reported  exposures,  gender  (2  possible 
values),  service  (6  values),  race  (8  values),  and  21  Boolean  symptoms. 

Possible  combinations  =  I29  x  2  x  6  x  7  x  215  =  1.48  x  1015 

It  is  clear  that  these  two  types  of  runs  present  a  credible  challenge  to  any  genetic  algorithm. 
They  are  both  computationally  explosive  (because  of  search  space  size)  and  highly  unstructured 
(because  of  the  high  number  of  LHS  and  especially  RHS  attributes),  yet  DaMI  has  processed 
them  with  striking  success. 

1.         Analysis  Speed 

DaMFs  search  efficiency  allows  it  to  perform  analyses,  which  normally  take  years,  in  a 
matter  of  hours.  Analysis  speed  is  the  time  required  for  a  genetic  algorithm  to  comprehensively 
search  the  given  space.  Comprehensive  search  will  be  dealt  with  shortly,  but  at  the  moment,  we 
will  focus  on  the  time  required  for  DaMI  to  complete  an  analysis.  If  mat  time  is  significantly 
less  than  would  be  possible  using  a  "brute  force"  examination  of  the  same  database,  then  the  first 
advantage  has  been  achieved.  As  mentioned  in  section  II,  it  was  observed  that  a  personal 
computer  can  test  about  600,000  possible  combinations  per  day.  If  that  is  the  case,  then  the 
exposure  to  diagnosis  run  should  take  about  432  billion  years— this  is  clearly  not  acceptable. 
Since  DaMI  never  searches  a  space  the  same  way  twice,  analysis  times  for  the  same  problem 
vary;  however,  DaMI  performs  the  same  analysis  in  36  hours  (on  average).  Exposure-to- 
symptom  runs  take  about  44  hours,  using  the  genetic  algorithm.  Although  the  exposure-to- 
symptom  runs  involve  a  smaller  search  space,  DaMI  requires  more  generations  to  converge  on  an 
answer.  Analysis  times  do  increase  in  relation  to  the  number  of  possible  combinations;  however, 
the  character  of  the  research  question  also  affects  the  time  required  for  DaMI  to  converge  on  an 
answer.  Analysis  times  of  similar  runs  are  fairly  consistent  (less  than  10%  deviation).  A  profile 
of  the  three  DaMI  exposure-to-diagnosis  runs  is  illustrated  in  Figure  #16. 
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Figure  16.  Analysis  Speed  Profile  of  Exposure-to-diagnosis  Runs 

Notice  that  the  processing  speed  increases  as  a  small  group  of  rules  begin  to  dominate  the 
population  (convergence).  It  must  be  reiterated  that  DaMI  uses  the  same  platform  as  was  used 
for  "brute  force"  testing;"  it  is  the  selectivity  of  search  (knowing  what  alternatives  need  not  be 
tested)  that  gives  this  methodology  its  incredible  advantage. 


2.         Hypothesis  Quality  Improvement 

DaMI  is  consistently  able  to  adaptively  improve  the  quality  of  the  hypotheses  it 
generates  as  the  analysis  progresses.  A  genetic  algorithm  is  theoretically  an  intelligent,  adaptive 
search  technique.  This  means  that  as  processing  time  passes,  the  system  will  generate 
hypotheses  of  increasing  quality  based  on  the  results  of  analyses  already  conducted.  In  the  case 
of  DaMI,  this  means  quality  is  indicated  by  the  fitness  measure  of  a  hypothesis.  The  cumulative 
fitness  of  a  generation  represents  the  aggregate  quality  of  all  the  hypotheses  synthesized  during 
that  generation.  Although  some  new  individuals  in  each  generation  may  receive  very  low  fitness 
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measures,  if  the  cumulative  fitness  increases  in  successive  generations,  then  the  quality  of 
hypotheses  as  a  whole  are  improving.  DaMI  demonstrates  the  characteristic  ability  of  genetic 
algorithms  to  rapidly  increase  the  quality  of  new  hypotheses  generated.  DaMI  rapidly  improves 
cumulative  fitness  until  a  small  group  of  rules  begins  to  dominate  the  population  [premature 
convergence  (Koza,  1989)],  but  (largely  because  of  same-parent  crossover  randomization)  it  then 
boosts  mutation  probability  and  continues  to  break  through  to  higher  cumulative  fitness  plateaus. 
A  profile  of  improving  hypothesis  quality  for  exposure-to-diagnosis  runs  is  presented  in  Figure 
#17.    Note  that  in  each  of  the  three  runs,  the  cumulative  fitness  curve  levels  (signaling  premature 
convergence)  and  then  continues  to  sporadically  increase. 
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Figure  17.  Analysis  Speed  Profile  of  Exposure  to  Diagnosis  Runs 

3.  Reproducibility:  Search  Space  Coverage 

While  a  genetic  algorithm  may  complete  a  search  quickly,  the  speed  advantage  is  of 
limited  value  without  some  indication  that  the  results  derived  are  actually  the  best  in  the  search 
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space.  DaMI  produces  consistent  reproducibility  on  the  extremely  large  spaces  it  searches, 
attesting  to  its  strong  ability  to  search  a  large  space  by  testing  a  small  subset  of  possible 
combinations.  As  discussed  in  section  V.D.3,  proving  that  a  genetic  algorithm  has  completely 
examined  a  space  is  a  paradoxical  question— you  cannot  prove  that  the  genetic  algorithm  made 
the  right  decision  without  testing  every  possible  hypothesis.  Reproducibility  gives  a  strong 
indication  that  the  alternative  space  has  been  searched  effectively    Ideally,  we  would  like 
multiple  independent  runs  of  the  genetic  algorithm  (see  section  V.D.3)  in  order  to  test  only  a  few 
of  the  same  rules  of  low  fitness  but  converge  on  the  same  rules  of  high  fitness.  A  low 
intersection  of  low  fitness  rules  between  runs  indicates  that  each  approached  convergence  from 
different  areas  of  the  search  space  (i.e.  they  did  not  all  follow  the  same  path).  A  high  intersection 
of  high  fitness  rules  suggests  that,  despite  entering  the  search  space  from  different  directions, 
each  independent  run  has  arrived  at  the  same  answer.  This  reproducibility  strongly  suggests  that 
the  entire  search  space  has  been  effectively,  but  not  physically,  examined. 

DaMI  achieves  high  reproducibility  in  spite  of  the  rapid  search  time  and  tremendous 
space.  In  the  exposure-to-diagnosis  study,  all  three  runs  agree  on  the  same  16  highest  fitness 
hypotheses.  Lower  fitness  hypotheses  show  steadily  decreasing  levels  of  intersection,  as  is 
theoretically  predicted.  This  is  particularly  exciting,  because  each  production  run  has  achieved 
consensus  by  testing  only  7,100  -  7,400  of  the  1,041,000  possible  attribute  combinations.  The 
probability  of  three  independent  runs  randomly  agreeing  on  the  same  sixteen  hypotheses 
(especially  since  each  run  is  testing  only  0.7  %  of  all  possible  attribute  combinations)  is 
infinite sim ally  small.  The  natural  question  is,  "Did  the  three  runs,  by  some  streak  of  luck,  enter 
the  search  space  from  the  same  starting  point?"  This  is  not  the  case,  because  the  three  runs  only 
tested  14. 1%  of  the  same  lower  fitness  rules,  proving  that  they  have  entered  the  space  from 
different  points  but  converged  on  the  same  answer.  Note  in  Figure  #18  that  the  percentage  of 
rule  intersection  (Runs  20,  21,  and  22  are  the  three  runs  conducted  in  the  exposure-to-diagnosis 
study)  between  runs  approaches  100%  for  rules  with  a  fitness  measure  higher  than  8.0.  This 
intersection  decreases  steadily  as  the  fitness  measure  decreases  (going  left  on  the  graph).  In  the 
case  of  exposure-to-symptoms,  the  reproducibility  is  not  as  high,  but  still  quite  striking.  In  this 
study,  each  run  tested  between  8,000  and  10,000  hypotheses.    The  three  runs  agree  on  5  of  6 
highest  fitness  hypotheses.  This  is  represented  in  Figure  #19  by  an  intersection  percentage  of 
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80%  on  hypotheses  with  a  fitness  of  over  5.31  (Runs  23,  24,  and  25  are  the  three  runs  conducted 
in  the  exposure-to-symptom  study).  Notice  that,  as  in  the  exposure-to-diagnosis  study,  the 
intersection  between  runs  decreases  as  the  fitness  measure  decreases,  culminating  with  an 
intersection  of  only  20%  for  rules  with  fitness  measures  between  1.0  and  3.0. 
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Figure  18.  Exposure-to-diagnosis  Reproducibility 
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Figure  19.  Exposure-to-symptom  Reproducibility 
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Based  on  the  high  reproducibility  of  DaMI  production  runs,  there  is  a  strong  indication 
that  the  search  space  has  been  effectively  searched  for  the  given  fitness  measure  and  search 
parameters.  This  is  particularly  significant  in  the  case  of  Desert  Storm  research.  Recall  that  the 
existence  of  any  syndrome  has  not  yet  been  determined.  Therefore,  if  DaMI  fails  to  find  a  viable 
syndrome  profile  but  can  show  that  the  space  has  been  searched  effectively,  that  information  will 
be  of  extremely  high  value  to  CCEP  research.  Additionally,  any  comprehensive  list  of 
correlations  between  risk  factors  and  medical  outcomes  will  be  of  value  to  PGW  participants  and 
the  medical  practitioners  providing  their  ongoing  medical  care. 

C.       WHAT  DID  DaMI  FIND? 

DaMI  has  proven,  by  the  standards  of  genetic  algorithm  theory,  that  it  has  studied  the 
CCEP  database  quickly,  intelligently,  and  comprehensively.  All  of  the  theory  and  development 
strategies  now  come  down  to  one  question,  "What  did  we  learn?"  Computational  results  so  far 
suggest  that  our  system  has  succeeded  at  the  given  tasks,  requiring  relatively  few  resources. 
Experiments  reveal  no  single  syndrome,  but  numerous  correlations  do  exist  that  require 
additional  clinical  analysis. 

Based  on  DaMI  research,  there  is  no  indication  that  a  single  syndrome  or  other  medical 
entity  is  causing  wide-spread  adverse  health  ramifications  among  a  significant  cross-section  of 
PGW  participants  in  the  CCEP  program.  By  "significant,"  we  mean  that  no  group  of  over  100 
participants,  sharing  a  common  reported  exposure/demographic  information,  exhibit  a  unique  set 
of  reported  symptoms  and/or  outcome  diagnoses.  Keep  in  mind  that  only  the  21  most  frequently 
reported  diagnoses  (and  combinations  of  these)  have  been  tested  to  date.  This  does  not  mean  that 
a  syndrome  cannot  exist,  but  the  data  collected  by  CCEP  and  specifically  studied  by  this  research 
does  not  indicate  such  a  correlation. 

There  are,  however,  numerous  correlations  of  exposure/demographic  information  and 
associated  symptoms/diagnoses  which  suggest  that  smaller  groups  may  share  common  health 
conditions  based  on  shared  exposure  to  common  health  risk  factors.  These  associations  are  based 
solely  on  statistical  correlation;  therefore,  a  final  determination  is  withheld  pending  review  of  the 
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information  by  medical  professionals.  In  any  case,  the  examined  data  suggests  a  need  for  further 
research. 

The  number  of  correlations  found  by  DaMI  is  quite  large;  we  have  resisted  summarizing 
hypotheses  to  preserve  the  robustness  of  the  information.  Therefore,  the  challenge  of  filtering 
and  reporting  awaits  the  input  of  CCEP  researchers.  Each  exposure-to-diagnosis  run  has 
produced  around  4,500  hypotheses,  and  each  exposure-to-symptom  run  has  produced  about  6,100 
hypotheses.  In  each  case,  the  three  sets  of  rules  are  combined  into  a  single  hypothesis  set  (with 
duplicates  removed).  The  information  has  been  further  refined,  subject  to  the  following  criteria: 

•  Hypotheses  applying  to  fewer  than  five  individuals  in  the  sample  set  have  been 
removed  to  prevent  undue  influence  by  single  outliers.  By  definition,  a  syndrome  is 
a  medical  condition  shared  by  a  number  of  individuals. 

•  Hypotheses  are  derived  from  a  randomly  selected  45%  sample  (without  replacement) 
subset  of  the  entire  CCEP  database.  These  hypotheses  are  tested  against  a  separate 
45%  (independent)  partition  of  the  CCEP  database.  Hypotheses  whose  fitness 
measure  in  the  second  (verification)  sample  differed  from  the  fitness  measure  from 
the  original  sample  by  more  than  20%  have  been  eliminated.  Fitness  measures 
which  remain  constant  over  both  the  original  and  verification  sample  are  called 
duplicable,  suggesting  they  hold  true  for  the  entire  database  and  are  not  a  statistical 
anomaly. 

The  application  of  the  aforementioned  selection  criteria  has  resulted  in  a  set  of  2,653  candidate 
hypotheses  concerning  exposure-to-diagnoses  and  4,959  hypotheses  concerning  exposure-to- 
symptoms.  No  minimum  fitness  measure  threshold  has  been  applied  because  the  modified  j- 
measure  is  an  arbitrary  score,  suitable  for  ranking  the  order  of  interest  of  competing  hypotheses. 
The  fitness  measure  may  not  be  attached  to  a  specific  interest  'level."  Obviously,  a  great  number 
of  the  hypotheses  having  low  fitness  measures  do  not  contain  correlations  strong  enough  to 
support  strong  research  attention.  For  this  reason  and  for  the  sake  of  brevity,  only  the  100 
highest  fitness  hypotheses  of  each  study  are  included  in  Appendix  C  and  discussed  in  the  next 
two  result  summary  sections. 
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These  two  sections  will  discuss  the  highlights  and  some  specific  hypotheses  from  both 
the  exposure-to-diagnosis  and  exposure-to-symptom  studies.  The  exposuie-to-diagnosis  and 
exposure-to-symptom  results  are  each  exciting  for  different  reasons.  The  exposure-to-diagnosis 
study  contains  many  high  confidence  correlations— hypotheses  which  are  applicable  to  over  50% 
of  the  participants  concerned.  The  exposure-to-diagnosis  hypotheses  contain  few  unexpected 
correlations,  but  clearly  demonstrate  the  ability  of  DaMI  to  cull  out  extremely  strong  correlations 
from  a  "mountain"  of  data.  The  exposure-to-symptom  results  contain  many  unexpected 
hypotheses,  but  with  somewhat  lower  correlation  strength.  The  exposure-to-symptom  results 
attest  to  the  sensitivity  of  DaMI  analysis  and  contain  new  (previously  undiscovered)  information 
which  should  attract  expanded  clinical  research. 

1.         Exposure-to-diagnosis  Correlations 

The  exposure-to-diagnosis  study  yields  a  large  number  of  strong  correlations  (positive 
predictive  values  between  exposure  and  diagnosis  of  over  50%)  and  provides  corroberation  to 
some  intuitive  aspects  of  medical  relationships.  Several  new  relationships  have  been  identified, 
but  few  hold  information  that  is  unexpected  by  the  non-medical  analyst,  at  least  when  studied 
separately  from  associated  symptoms.  DaMI  demonstrates  a  powerful  ability  to  cull  strong 
correlations  from  a  large  body  of  data,  and  in  that  respect,  the  results  are  very  exciting.  It  must 
be  reiterated  that  only  combinations  of  the  21  most  frequently  occurring  diagnoses  have  been 
considered  at  this  point.  However,  a  restructuring  of  the  CCEP  diagnosis  representation  which 
groups  like  diagnoses  (with  differing  ICD  codes)  may  bear  even  more  information. 

No  single  exposure  or  group  of  exposures  appeals)  to  dominate  the  resulting  hypotheses 
set,  unlike  what  will  be  seen  in  the  exposure-to-symptom  study.  Several  exposures  (but  no 
demographic  attributes)  appeared  in  many  of  the  100  highest  fitness  hypothesis.  19%  of  the 
hypotheses  included  participants  who  were  wounded  and  another  19%  included  participants  who 
saw  casualties.  Yet  another  19%  of  hypotheses  included  participants  who  reported  exposure  to 
"other  paints"  and  12%  reported  exposures  to  nerve  gas.  At  first,  the  fact  that  many  hypotheses 
include  wounded  participants  appears  interesting  because  only  1%  of  participants  in  the  CCEP 
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database  have  been  wounded.  Also,  only  4%  of  CCEP  participants  report  exposure  to  nerve  gas, 
so  that  too  seems  to  be  highly  represented  in  the  hypotheses    Casualties  and  other  paints  in 
hypotheses  are  less  surprising  since  both  have  been  highly  reported  by  CCEP  participants  (50% 
and  38%  respectively).  However,  37%  of  the  hypotheses  discovered  include  Post-traumatic 
Stress  Disorder  and  22%  include  Depression  (CCEP,  1996,  p  19).  This  high  number  of  Psycho- 
social diagnosis  prevalence  in  the  hypothesis  set  decreases  the  surprise  that  many  hypotheses 
concern  wounded  participants  (as  the  two  are  commonly  associated).  Surprisingly,  Severe  Sleep 
Apnea  is  included  in  20%  of  the  hypotheses.  Sleep  Apnea  is  a  medical  condition  not  commonly 
linked  to  any  CCEP  reported  exposure.  This  leaves  only  the  prevalence  of  reported  Nerve  Gas 
exposures  and  the  diagnosis  of  Sleep  Apnea  in  hypotheses  as  the  only  unexpected  attributes, 
from  a  macro  perspective.  Reported  nerve  gas  exposure  is  all  the  more  unexpected  because 
chemical  alarms  and  mustard  gas  (similar  participant  concerns)  are  notably  scarce  from  the 
hypotheses.  It  will  be  seen  later  that  reported  nerve  gas  exposure  plays  a  significant  role  in  the 
exposure-to-symptom  study.  Finally,  it  should  be  noted  that  oil  and  smoke,  heat  and  smoke, 
Pyridistine  Hydrobromide  (Pb),  and  headaches  are  included  in  few  hypotheses— all  are  factors 
receiving  high  attention  in  CCEP  research. 

An  explanation  of  the  DaMI  reporting  format  is  included  in  Figure  #20.  While  the  space 
is  not  available  to  discuss  even  the  100  highest  fitness  hypotheses,  several  illustrative  hypotheses 
are  presented  now  in  Figure  #21 .  Especially  in  the  exposures-to-diagnosis  study,  DaMI 
demonstrates  the  ability  to  unmask  high  level  of  association  between  exposure/demographic  and 
diagnosis  attributes.  This  association  is  not  limited  to  high  positive  predictive  value  (high 
probability  of  then  condition  given  the  //condition),  but  is  also  able  to  look  at  the  associations  in 
reverse  (high  probability  of  (/"condition  given  the  then  condition)  and  examine  the 
contraindications  (//condition  precludes  the  then  condition)  between  exposures/demographics 
and  diagnoses.  An  example  of  each  association  type  is  presented  below.  The  medical 
professional  is  referred  to  Appendix  C  for  a  complete  list  of  hypotheses. 
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Figure  20.  How  to  Read  a  DaMI  Report 
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Figure  21.  Exposure-to-diagnosis  Examples 
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As  stated  before,  the  exposure-to-diagnosis  examples  presented  here  demonstrate  the 
capability  of  DaMI  to  dig  into  a  "mountain"  of  data  and  find  strong  hypotheses.  The  examples 
selected  for  presentation  here  are  selected  to  illustrate  that  capability.  It  is  highly  recommended 
that  the  medical  professional  examine  all  of  the  hypotheses  (Appendix  C)  in  detail.  Figure 
#2 1(a)  is  a  hypothesis  of  extremely  high  positive  predictive  value.  The  hypothesis  states  that 
94%  of  participants  diagnosed  with  mechanical  lower  back  pain  and  major  depression  served  in 
the  Army.  94%  is  an  extremely  high  correlation  for  such  a  broad  hypothesis  (a  specific  diagnosis 
combination  is  linked  to  a  single  service).  Note  that  both  the  fitness  measure  obtained  using  the 
analysis  database  {complex  association  factor)  is  quite  close  (2.39/2. 10)  to  that  of  the  verification 
database  {complex  association  verification),  suggesting  that  the  rule  holds  for  all  participants  (not 
a  statistical  anomaly).  The  hypothesis  illustrated  in  figure  #2 1(b)  is  much  more  specific,  but  is 
still  quite  strong.  This  hypothesis  states  that  77%  of  the  participants  diagnosed  with 
DJD/Osteoarthritis  and  Severe  Sleep  Apnea  reported  earing  Non-allied  Forces 
food  and  reported  exposure  to  pesticides.  DaMI  is  capable  of  isolating  strong  data  correlations, 
regardless  of  hypotheses  specificity. 
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Figure  22.  Exposure-to-diagnosis  Examples 

The  next  two  hypotheses  are  equally  interesting,  but  are  much  more  difficult  to  find 
using  conventional  search  techniques.  DaMI,  using  the  Modified  J-measure  is  able  to  see 
correlations  which  do  not  fit  the  high  positive  predictive  value  paradigm.  The  hypothesis  in 
Figure  #22(a)  states  that  18%  of  Marine  participants  reporting  exposure  to  pesticides  and  malaria 
have  been  diagnosed  with  asthma.  A  positive  predictive  value  of  18%  does  not  jump  out  at  the 
analyst  and  would  therefore  not  figure  prominently  in  a  conventional  analysis;  however,  DaMI 
notes  that  only  5. 1%  of  all  participants  have  been  diagnosed  with  Asthma.  This  means  that 
Marines  reporting  pesticide  and  malaria  exposure  are  3.5  times  more  likely  to  have  been 
diagnosed  with  Asthma  than  the  general  CCEP  participant  population.  In  light  of  that  fact,  the 
18%  positive  predictive  value  of  this  hypothesis  is  indeed  significant,  and  DaMI  has  assigned  it  a 
high  fitness  measure.  The  hypothesis  in  Figure  #22(b)  is  an  example  of  contraindication.  Note 
that  this  hypothesis  shows  no  high  correlation  in  either  direction.  The  hypothesis  states  that  2% 
of  participants  reporting  no  exposure  to  Pb  and  not  viewing  casualties  have  been  diagnosed  with 
Post-traumatic  Stress  Disorder  (PTSD).  The  reader's  attention  is  directed  to  the  matrix  on  the 
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right  section  of  the  hypothesis  report.  In  589  cases  where  the  LHS  is  true,  the  RHS  is  false. 
Also,  in  424  cases  where  the  RHS  is  true,  the  LHS  is  false.   1,022  participants  report  information 
that  in  some  way  involves  this  hypothesis'  exposures  or  diagnosis.  In  99%  of  those  cases,  the 
exposures  exclude  the  diagnosis  outcome.  In  plain  English,  not  reporting  exposure  to  Pb  or 
casualties  precludes  a  diagnosis  of  PTSD.  This  fact,  although  readily  apparent  to  conventional 
analysis,  is  very  informative  because  of  its  exclusive  properties  and  is  therefore  flagged  by 
DaM. 

The  exposure-to-diagnosis  study  hypotheses  exemplify  the  ability  of  our  genetic 
algorithm  to  find  both  strong,  obvious  correlations  and  more  intricate  associations  in  the  CCEP 
database.  Many  of  the  hypotheses  reinforce  "common  sense"  medical  knowledge,  but  remember 
that  DaMI  has  discovered  these  hypotheses  without  the  benefit  of  prior  medical  knowledge  of 
any  kind.  In  light  of  this  success,  serious  attention  should  be  directed  toward  those  hypotheses 
presented  that  do  not  conform  to  present-day  medical  perceptions. 

2.  Exposure-to-symptom  Correlations 

The  exposure-to-symptom  study  is  more  comprehensive  than  the  diagnosis  studies 
because  the  exposure-to-symptom  runs  consider  every  reported  symptom  category,  not  a  top 
stratification.  Many  individual  hypotheses  contain  new  (or  unexpected)  correlations  and  there 
also  several  interesting  trends  revealed  the  about  hypotheses  as  a  group.  This  previously 
undiscovered  information  is  of  key  interest  to  medical  researchers.  The  author  believes  that  this 
is  the  reason  that  exposure-to-symptom  runs  consistently  take  longer  to  converge  and  are 
somewhat  less  successful  at  reproducing  than  exposure-to-diagnosis  runs.  Even  though  the 
theoretical  search  space  of  exposure-to-symptom  runs  is  smaller,  the  actual  search  space  contains 
more  represented  combinations  (because  all  attributes  are  included)  and  is  therefore  practically 
more  difficult  to  solve.  This  explains  the  difference  in  run  times  for  different  studies  noted 
previously. 

While  the  exposure-to-diagnosis  runs  contain  several  intuitively  obvious  correlations,  the 
exposure-to-symptom  runs  produce  several  strong  but  "unexpected"  trends.  These  unexpected 
trends  take  the  form  of  pervasive  exposure  and  symptom  combinations  appearing  in  many  of  the 
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highest  fitness  hypotheses,  despite  the  feet  that  these  combinations  are  not  prevalent  in  the  CCEP 
database  as  a  whole.  These  are  the  specific  threads"  of  information  that  DaMI  has  been 
designed  to  discover. 

Several  exposure  attributes  appear  many  times  in  the  highest  fitness  exposure-to- 
symptom  hypotheses: 

•  over  50%  of  the  hypotheses  include  reported  exposure  to  mustard  gas  (singly  or  in 
combination) 

•  almost  25%  include  reported  exposure  to  nerve  gas 

•  14%  include  participants  that  were  wounded  in  combat 

•  12%  include  participants  reporting  some  form  of  pre-conflict  reproductive 
difficulties. 

This  is  somewhat  unusual  because  all  of  these  attributes  are  reported  relatively  infrequently  in  the 
CCEP  database  as  a  whole.  Mustard  gas  exposure  has  been  reported  by  2%  of  CCEP 
participants,  nerve  gas  6%,  wounded  in  combat  2%,  and  pre-conflict  reproductive  difficulties 
5.5%  (CCEP,  1996,  p.  19).  Finally,  the  combination  of  reported  nerve  gas  exposure  and  pre- 
conflict  reproductive  difficulties  occurs  in  9%  of  the  top  hypotheses.  Notably  scarce  are 
hypotheses  involving  actual  combat,  chemical  alarms,  scud  attacks,  race,  service,  or  post-conflict 
reproductive  difficulties.  It  is  surprising  that  since  pre-  and  post-conflict  reproductive  difficulties 
are  so  highly  statistically  correlated,  that  post-conflict  reproductive  difficulties  do  not  appear  in 
any  of  the  top  hypotheses. 

Similarly,  the  symptoms  bleeding  gums  and  weight  loss  are  each  included  in  over  50% 
of  the  hypotheses,  and  44%  of  the  hypotheses  involve  a  combination  of  both  bleeding  gums  and 
weight  loss.  Only  127  (or  1.6%)  of  the  participants  in  the  CCEP  database  subset  studied  (7746 
total  participants)  reported  that  specific  combination  of  symptoms.  It  is  extremely  interesting 
that  so  many  hypotheses  involve  bleeding  gums  and  weight  loss,  when  these  two  symptoms  are 
so  scarce  in  the  CCEP  database  at  large.  Also  noteworthy  is  the  large  number  of  hypotheses 
relating  reported  mustard  gas  exposure  to  bleeding  gums  and  weight  loss  (44%  of  hypotheses) 
and  nerve  gas  exposure  and  pre-conflict  reproductive  difficulties  with  bleeding  gums  (9%  of 
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hypotheses).  Notably  scare  in  the  hypotheses  are  hypotheses  including  joint  pain,  head  aches, 
and  fatigue,  the  symptoms  most  commonly  elicited  by  physicians  (CCEP,  1996,  p.  20). 

While  thesis  constraints  prohibit  discussing  all  100  of  the  highest  fitness  hypotheses, 
several  are  included  to  illustrate  some  of  the  correlations  discovered  (Figure  #  23). 
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Figure  #23.  Exposures  to  Symptom  Examples 

The  hypothesis  in  Figure  #23(a)  is  included  to  demonstrate  that  DaMI,  without  the  aid 
of  medical  knowledge,  will  discover  intuitively  obvious  (to  medical  researcher)  correlations. 
This  hypothesis  states  that  70%  of  Navy  participants  who  report  exposure  to  diesel  fuel  and 
mustard  gas  also  complain  of  difficulty  breathing.  It  is  understandable  that  anyone  perceiving  an 
exposure  to  mustard  gas  and  who  works  with  diesel  fuel  may,  at  some  time,  have  suffered  from 
difficulty  breathing. 

In  Figure  #23  (b),  it  is  noted  that  21%  of  participants  reporting  exposure  to  nerve  gas  and 
pre-conflict  reproductive  difficulties  complain  of  both  bleeding  gums  and  muscle  pain.  Note  that 
the  fitness  measure  (2.85)  in  the  analysis  database  is  very  close  to  that  of  the  verification 
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database  (2.43),  indicating  that  the  hypothesis  holds  across  different  independent  samples  of  the 
entire  CCEP  database.  This  hypothesis  can  be  considered  unexpected  because  this  specific 
exposure  combination  is  reported  by  only  .5%  of  the  participants  and  the  symptomatology  by 
only  3.9%. 

In  Figure  #23(c),  it  is  noted  that  9%  of  participants  reporting  exposure  to  nerve  gas  and 
mustard  gas,  complain  of  both  bleeding  gums  and  weight  loss.  As  before,  the  fitness  measures 
(2.77/2.41)  of  both  the  analysis  and  verification  database  are  quite  close.  Also  note  that  this 
hypothesis  holds  in  both  directions;  6%  of  participants  reporting  bleeding  gums  and  weight  loss 
reported  exposure  to  nerve  gas  and  mustard  gas.  This  hypothesis  is  also  considered  unexpected 
because  this  specific  exposure  combination  is  reported  by  only  1%  of  the  participants  and  the 
symptomatology  by  only  1 .6%. 

In  summation,  the  exposure-to-symptom  study  brings  to  light  several  correlations  which 
warrant  further  clinical  analysis.  Interest  lies,  not  only  in  the  hypotheses  themselves,  but  also  in 
the  high  number  of  correlations  involving  rare  combinations  of  exposures  and  symptoms. 

D.       ARE  THE  RESULTS  USEFUL  TO  MEDICAL 
PROFESSIONALS? 

The  results  of  both  the  Exposure-to-diagnosis  and  Exposure-to-symptom  studies  and 
research  methodology  have  been  reviewed  by  Ph.D.  Epidemiologists  on  the  CCEP  staff  and  the 
Director  of  the  Deployment  Surveillance  Team.  CCEP  Epidemiologists  feel  that  DaMI  has  great 
potential  for  "identifying  previously  unrecognized  patterns  of  symptoms  and  diagnoses."  (CCEP, 
Sep  1996)  They  also  agree  that  DaMI  has  already  identified  many  associations  in  the  CCEP 
database  that  have  not  been  found  by  conventional  methods.  However,  they  strongly  emphasize 
that  DaMI  result  hypotheses  must  be  subjected  to  a  more  detailed,  epidemeological-based  post- 
processing before  they  can  be  of  practical  use  to  the  CCEP  research  effort.  They  recommend  that 
future  DaMI  research  efforts  be  more  closely  coordinated  with  CCEP  epidemiologists.  The 
bottom  line  is  that  the  substantial  potential  of  DaMI  as  a  research  tool  has  been  recognized  by  the 
medical  researchers  and  the  research  sponsor  has  directed  that  DaMI  be  included  actively  in  the 
study  of  Desert  Storm  Syndrome  with  the  closer  involvement  of  CCEP  epidemiologists. 
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VII.  CONCLUSION 

After  many  months  of  theoretical  development,  genetic  algorithm  design,  and  fine 
tuning,  DaMI  has  accomplished  its  goal— to  comprehensively  search  the  CCEP  Desert  Storm 
database  and  provide  medical  researchers  with  a  subset  of  several  thousand  hypotheses  for  further 
investigation  from  the  billions  of  possible  combinations.  DaMI  has  proven  its  ability  to  search 
an  extremely  large  unstructured  database  and  cull,  in  a  reasonable  amount  of  time,  a  subset  of  the 
highest  interest  rules  within  that  database.  DaMI  has  more  to  tell  us  about  the  CCEP  database,  as 
it  can  be  retuned  for  different  search  priorities  and  measures  of  interest.  It  may  also  be  applied  to 
any  number  of  similar  bodies  of  medical  and  non-medical  data. 

This  research  began  with  a  formidable  analysis  problem  and  an  idea  that  the  usefulness 
of  computer  analysis  could  extend  beyond  the  conventional  paradigm  of  "number  crunching." 
The  author  believed  that  by  imparting  a  genetic  algorithm  with  a  model  of  a  human  researcher's 
interest,  that  the  genetic  algorithm  could  intelligently  attack  a  tremendous  search  problem  and 
reduce  it  to  a  manageable  size,  given  limited  resources.  We  have  taken  a  complex  research 
question  and  unstructured  database  and  formulated  both  into  a  workable  representation  of 
researcher  interest  and  usable  source  of  study.  A  genetic  algorithm  (DaMI)  has  been  created 
which  can  perform  a  self-adapting,  intelligent  search  with  striking  results.  In  short,  DaMI  has 
achieved  our  vision  and  exceeded  our  wildest  expectations.  This  thesis  has  shown  only  one 
venture  into  this  new  realm  of  medical  research,  pre-emptive  employment  of  genetic  algorithm 
analysis;  there  are  certainly  many  more  adventures  awaiting. 

A.       LESSONS  LEARNED 

The  author  encountered  few  problems  during  this  thesis  process.  This  thesis  involves  a 
very  high  visibility  and  politically  sensative  subject,  Desert  Storm  Syndrome.  As  such,  there 
were  numerous  requirements  for  presentations  and  progress  meetings  in  addition  to  the  normal 
research  challenges.  Since  the  political  obligations  were  linked  to  the  feedback  from  the 
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sponsoring  agency  they  could  not  be  ignored;  this  placed  a  very  high  time  demand  on  the  author. 
Also,  the  sponsoring  agency  is  located  in  Washington,  D.C.,  so  a  great  deal  of  travel  and  remote 
communication  was  required  to  ensure  adequete  project  coordination.  Finally,  feedback  for 
medical  researchers  in  the  field  was  very  difficult  to  obtain  because  of  their  diverse  geographic 
locations  and  limited  availability. 

The  author  has  learned  several  valuable  lessons  from  the  thesis  process: 

•  When  doing  a  thesis  involving  data  analysis,  do  not  wait  for  results  to  start  writing  the  thesis. 
A  great  deal  of  the  thesis  itself  describes  the  theoretical  basis  and  methodology  of  the 
research,  and  therefore,  can  be  written  before  final  results  are  achieved.  The  pressure  of 
"doing  the  write-up"  is  a  serious  burden  to  good  analysis  and  writing  early  helps  to  alleviate 
that  pressure. 

•  If  the  thesis  is  directly  funded  by  an  outside  agency  (in  my  case  the  CCEP),  it  is  important  to 
clearly  identify  a  liaison  at  that  agency.  In  my  case,  there  was  not  a  clear  procedure  for 
information  exchange  established  during  the  first  half  of  the  project,  which  made 
coordination  haphazard.  Once  a  clear  coordination  mechanism  was  put  in  place,  the  thesis 
process  became  much  smoother. 

•  It  is  critical  that  a  researcher  have  a  sounding  board  who  is  not  directly  attached  to  the 
research.  It  was  very  easy  for  me  to  become  so  engrossed  in  the  problem,  that  I  began 
missing  glaring  solutions.  I  was  lucky  to  have  a  single  individual  (not  a  genetic  algorithm  or 
medical  expert  per  say)  who  reality  checked  my  research  and  reviewed  my  thesis  throughout 
my  research.  This  feedback  has  proven  invaluable  to  the  quality  of  my  thesis  and  the  success 
of  my  research. 


B.  RECOMMENDATIONS  FOR  FUTURE  RESEARCH 

The  success  of  DaMI  opens  the  door  to  countless  opportunities  for  future  research.  Two 
areas  of  study  remain  to  be  explored  in  the  CCEP  database: 
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•  Analysis  of  demographic/exposure  and  a  restructured  diagnosis  set.  Efforts  are 
currently  underway  to  regroup  participant  diagnosis  information  so  that  similar 
diagnoses  (even  those  with  vastly  divergent  ICD  codes)  are  grouped  together.  This 
will  allow  DaMI  to  analyze  a  majority  of  diagnoses,  as  opposed  to  the  top  21 
diagnoses  as  presented  in  this  thesis. 

•  Analysis  of  time/motion  study  of  units  and  their  locations  during  the  Persian  Gulf 
Conflict.  Since  in  many  cases  units  are  homogenous  in  location  and  therefore 
exposure  to  health  risks,  an  analysis  of  the  CCEP  participants'  unit  location  in  time 
and  associated  symptoms  and/or  diagnoses  should  prove  quite  fruitful. 

It  should  be  obvious  that  DaMI  has  not  been  created  with  the  sole  intent  of  searching  for 
a  Desert  Storm  Syndrome.  It  is  applicable  to  many  other  large,  unstructured  databases  of 
medical  and  non-medical  data.  Aside  from  examining  other  bodies  of  data,  there  are  several 
areas  to  investigate  concerning  DaMI  itself: 

•  Comparison  of  DaMI  performance  with  other  commercial  data  mining  software  and 
other  data  mining  techniques  (like  regression  analysis,  cluster  analysis,  and  neural 
networks). 

•  Modification  of  DaMI's  statistical  package  to  use  alternative  fitness  functions,  such 
as  Chi-square  instead  of  just  the  Modified  J-measure. 

•  Enhancement  of  the  DaMI  genetic  algorithm  to  utilize  parallel-processing  for 
statistical  computations.  Clearly  using  a  single  PC  is  less  efficient  than  a  group  of 
PC  nodes  operating  simultaneously.  This  will  dramatically  increase  search  speed 
without  increasing  the  complexity  of  computer  hardware  required. 

•  Rewriting  of  the  DaMI  code  into  C++  or  Ada,  so  that  it  can  run  on  a  higher  capacity 
computer  platform.  Of  course,  this  will  increase  efficiency,  but  will  make  the 
algorithm  more  restrictive  (less  portable)  in  terms  of  operating  platforms. 
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APPENDIX  A.  CCEP  DATA  DICTIONARIES  AND  DATA 
COLLECTION  METHODOLOGY 


A.       DATA  DICTIONARY  OF  CCEP  DATABASE 
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Delete 

12 

SMOKE_NOW 

Text 

attribute 

has  U's 

13 

NM_CG_NOW 

Text 

3 

attribute  ? 

14 

SMOKE_PAST 

Text 

attribute 

has  U's 

15 

NM_CG_PAST 

Text 

3 

attribute  ? 

16 

OIL_SMOKE 

Text 

attribute 

has  U's 

17 

HEAT.SMOKE 

Text 

attribute 

has  U's 

18 

PASS_SMOKE 

Text 

attribute 

has  U's 

19 

DIESL_FUEL 

Text 

attribute 

has  U's 

20 

CARC_PAINT 

Text 

attribute 

has  U's 

21 

OTHR_PAINT 

Text 

attribute 

has  U's 

22 

OTHR_SOLVE 

Text 

attribute 

has  U's 

23 

URANIUM 

Text 

attribute 

has  U's 

24 

MICROWAVES 

Text 

attribute 

has  U's 

25 

PESTICIDES 

Text 

attribute 

has  U's 

26 

NERVE_GAS 

Text 

attribute 

has  U's 

27 

PYRIDOSTIG 

Text 

attribute 

has  U's 

28 

MUSTRD_GAS 

Text 

attribute 

has  U's 

29 

CONTM_FOOD 

Text 

attribute 

has  U's 

30 

CONTM_WATR 

Text 

attribute 

has  U's 

31 

NONAF_WATR 

Text 

attribute 

has  U's 

32 

NONAF_FOOD 

Text 

attribute 

has  U's 

33 

ANTHRAX 

Text 

attribute 

has  U's 

34 

BOTULISM 

Text 

attribute 

has  U's 

35 

MALARIA 

Text 

attribute 

has  U's 

36 

OTHER_EXP1 

Text 

35 

attribute 

has  U's 

37 

OTHER_EXP2 

Text 

35 

attribute 

has  U's 

38 

OTHER  EXP3 

Text 

35 

attribute 

has  U's 

39 

ACT  COMBAT 

Text 

1 

attribute 

has  U's 
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4C 

I  WOUNDED 

Text 

1 

attribute 

has  U's 

' 

41 

CASUALTIES 

Text 

1 

attribute 

has  U's 

— 

42 

'  SCUD_ATTAC 

Text 

1 

attribute 

has  U's 

_ 

43 

CHEM_ALARM 

Text 

1 

attribute 

has  U's 

44 

PQ_CHD_P 

Number  (Dou 

8 

attribute 

4b 

PQ_CHD_A 

Number  (Dou 

8 

attribute 

46 

PQ_INF_P 

Text 

1 

attribute 

combine  into  single  field 

4/ 

PQ_INF_A 

Text 

1 

attribute 

i 

48 

PQ_MIS_P 

Number  (Dou 

8 

attribute 

■ 

49 

PQ_MIS_A 

Number  (Dou 

8 

attribute 

i 

50 

PQ_SB_P 

Number  (Dou 

8 

attribute 

i 

51 

PQ_SB_A 

Number  (Dou 

8 

attribute 

i 

52 

PQ_ID_P 

Number  (Dou 

8 

attribute 

n 

53 

PQ_ID_A 

Number  (Dou 

8 

attribute 

n 

54 

PQ_DEF_P 

Number  (Dou 

8 

attribute 

i 

55 

PQ_DEF_A 

Number  (Dou 

8 

attribute 

combine  into  single  field 

56 

SPON_LNAME 

Text 

20 

no 

privacy  act 

delete 

57 

SPON_FNAME 

Text 

11 

no 

privacy  act 

delete 

58 

SPON_MNAME 

Text 

11 

no 

privacy  act 

delete 

59 

SEX 

Text 

1 

demographic 

blanks 

60 

RACE 

Text 

1 

demographic 

blanks 

61 

MAR_STATUS 

Text 

1 

demographic 

blanks 

62 

DUTY_STAT 

Text 

6 

attribute 

don't  know  code 

63 

MOS_NEC_AF 

Text 

7 

attribute 

blanks  (not  too  many) 

64 

LOST_WORK 

Number  (Dou 

8 

maybe 

question  info  value 

LOFR 

65 

CHIEF_COMP 

Text 

35 

no 

text 

delete 

66 

CHIEF_DTE 

Date/Time 

8 

attribute  ? 

question  info  value 

LOFR 

67 

CHIEF_DURA 

Number  (Dou 

8 

no 

different  for  diff  diags 

delete 

68 

FATIG_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

69 

FATIG_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

/O 

ABDOM  DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

71 

ABDOM_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

/2 

BLEED_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

73 

BLEED_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

74 

DEPRE_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

/5 

DEPRE  DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

/6 

DIARR_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

// 

DIARR_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

78 

DIFFI  DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

/9 

DIFFI_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

80 

SHORT_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

81 

SHORT  DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

82 

HAIRL_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

83 

HAIRL_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

84 

HEADA_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

85 

HEADA_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

86 

JOINT_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

8/ 

— zlzt~ 

JOINT_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

88 

MEMOR_DTE 

Date/Time      1 8 

maybe 

question  info  value 

LOFR 
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89 

MEMOR_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

90 

MUSCL_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

91 

MUSCL_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

92 

RASH  DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

93 

RASH_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

94 

SLEEP  DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

95 

SLEEP_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

96 

WEIGH_DTE 

Date/Time 

8 

maybe 

question  info  value 

LOFR 

97 

WEIGH_DURA 

Number  (Dou 

8 

attribute 

number  confuses  algo 

yes/no 

98 

OTHR1_COMP 

Text 

20 

no 

can't  correlate  text 

delete 

99 

OTHR1_DTE 

Date/Time 

8 

no 

can't  correlate  text 

delete 

100 

OTHR1_DURA 

Number  (Dou 

8 

no 

cant  correlate  text 

delete 

101 

OTHR2_COMP 

Text 

20 

no 

can't  correlate  text 

delete 

102 

OTHR2_DTE 

Date/Time 

8 

no 

can't  correlate  text 

delete 

103 

OTHR2_DURA 

Number  (Dou 

8 

no 

cant  correlate  text 

delete 

104 

OTHR3_COMP 

Text 

20 

no 

can't  correlate  text 

delete 

105 

OTHR3_DTE 

Date/Time 

8 

no 

can't  correlate  text 

delete 

106 

OTHR3_DURA 

Number  (Dou 

8 

no 

cant  correlate  text 

delete 

107 

OTHR4_COMP 

Text 

20 

no 

cant  correlate  text 

delete 

108 

OTHR4_DTE 

Date/Time 

8 

no 

cant  correlate  text 

delete 

109 

OTHR4_DURA 

Number  (Dou 

8 

no 

cant  correlate  text 

delete 

110 

PRI_DIAG 

Text 

40 

no 

text 

delete 

111 

PRIJCD 

Text 

6 

RHS 

112 

SEC_DIAG1 

Text 

40 

no 

text 

delete 

113 

SECJCD1 

Text 

6 

RHS 

blanks 

114 

SEC_DIAG2 

Text 

40 

no 

text 

delete 

115 

SECJCD2 

Text 

6 

RHS 

blanks 

116 

SEC_DIAG3 

Text 

40 

no 

text 

delete 

117 

SECJCD3 

Text 

6 

RHS 

blanks 

118 

SEC_DIAG4 

Text 

40 

no 

text 

delete 

119 

SEC_ICD4 

Text 

6 

RHS 

blanks 

120 

SEC_DIAG5 

Text 

40 

no 

text 

delete 

121 

SECJCD5 

Text 

6 

RHS 

blanks 

122 

SEC_DIAG6 

Text 

40 

no 

text 

delete 

123 

SEC  ICD6 

Text 

6 

RHS 

blanks 

124 

ALLER  CONS 

Text 

no 

question  info  value 

delete 

125 

AUDIO  CONS 

Text 

no 

question  info  value 

delete 

126 

CARDI_CONS 

Text 

no 

question  info  value 

delete 

127 

DENTL  CONS 

Text 

no 

question  info  value 

delete 

128 

DERMA_CONS 

Text 

no 

question  info  value 

delete 

129 

EARNT  CONS 

Text 

no 

question  info  value 

delete 

130 

ENDOC  CONS 

Text 

no 

question  info  value 

delete 

131 

GASTR  CONS 

Text 

no 

question  info  value 

delete 

132 

HEMAT  CONS 

Text 

no 

question  info  value 

delete 

133 

INFEC  CONS 

Text 

no 

question  info  value 

delete 

134 

NEPHR  CONS 

Text 

no 

question  info  value 

delete 

135 

NEURO_CONS 

Text 

no 

question  info  value 

delete 

136 

OCCUP  CONS 

Text 

no 

question  info  value 

delete 

137 

PULMO  CONS 

Text 

no 

question  info  value 

delete 
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138 

PSYCH_CONS 

Text 

no 

question  info  value 

delete 

139 

PTEST_CONS 

Text 

no 

question  info  value 

delete 

140 

RHEUM_CONS 

Text 

no 

question  info  value 

delete 

141 

MOVE_ON 

Text 

no 

question  info  value 

delete 

142 

DIAG_DTE 

Date/Time 

8 

no 

question  info  value 

delete 

143 

DIAG_DONE 

Text 

no 

question  info  value 

delete 

144 

PTQS_DONE 

Text 

no 

question  info  value 

delete 

145 

PRQS_DONE 

Text 

no 

question  info  value 

delete 

146 

IREL_DONE 

Text 

no 

question  info  value 

delete 

147 

DECL_DONE 

Text 

no 

question  info  value 

delete 

148 

HOME_ADDR1 

Text 

30 

no 

privacy  act 

delete 

149 

HOME_ADDR2 

Text 

30 

no 

privacy  act 

delete 

150 

HOME_TOWN 

Text 

20 

no 

privacy  act 

delete 

151 

HOME_STATE 

Text 

2 

demographic 

152 

HOME_ZIP 

Text 

5 

no 

info  too  specific 

delete 

153 

WORK_PHONE 

Text 

12 

no 

privacy  act 

delete 

154 

HOME_PHONE 

Text 

12 

no 

privacy  act 

delete 

155 

DCFORM_DTE 

Date/Time 

8 

no 

no  info  value 

delete 

156 

STARTLATER 

Text 

no 

no  info  value 

delete 

157 

WHENTOCALL 

Text 

15 

no 

no  info  value 

delete 

158 

DECLINE 

Text 

no 

no  info  value 

delete 

159 

WITHDRAW 

Text 

no 

no  info  value 

delete 

160 

EVAL_COMP 

Text 

no 

no  info  value 

delete 

161 

SATISFIED 

Text 

attribute  ? 

question  info  value 

162 

PQ_DATE 

Date/Time 

8 

no 

no  info  value 

delete 

163 

PQ_EVALDTE 

Date/Time 

8 

no 

no  info  value 

delete 

164 

MIL_ADDR1 

Text 

30 

no 

no  info  value 

delete 

165 

MIL_ADDR2 

Text 

30 

no 

no  info  value 

delete 

166 

MIL_STATE 

Text 

2 

no 

no  info  value 

delete 

167 

MIL_ZIP 

Text 

5 

no 

no  info  value 

delete 

168 

CHECKL_DTE 

Date/Time 

8 

no 

no  info  value 

delete 

169 

REPORT_DTE 

Date/Time 

8 

no 

no  info  value 

delete 

170 

REPORT_TIM 

Text 

8 

no 

no  info  value 

delete 

171 

PRIOR_JAN 

Text 

no 

no  info  value 

delete 

172 

REFUSED 

Text 

no 

no  info  value 

delete 

173 

NEGLECTED 

Text 

no 

no  info  value 

delete 

174 

EDS_VIEWED 

Yes/No 

no 

no  info  value 

delete 

175 

DCF_MISSIN 

Text 

no 

no  info  value 

delete 

176 

UIC 

Text 

8 

attribute 

177 

PHASE 

Text 

1 

no 

no  info  value 

delete 
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B.       DATA  COLLECTION  METHODS 

This  section  is  quoted  directly  from  (CCEP,  1996,  pp.  13-14) 

Participants  may  enroll  in  the  CCEP  by  calling  a  toll-free  number  (1-800-796-9699), 
which  provides  information  and  referrals  to  individuals  requesting  medical  evaluations  or  by 
contacting  their  local  military  medical  treatment  facility  (MTF).  All  MHSS  eligible  beneficiaries 
are  eligible  for  the  CCEP.  For  eligibility  in  the  CCEP,  a  PGW  veteran  (or  dependent)  must  have 
been  eligible  for  DoD  health  care  in  June  1994  or  later. 

Once  an  individual  is  referred,  the  CCEP  provides  a  two-phase,  comprehensive  medical 
evaluation,  with  Phase  I  being  conducted  at  one  of  184  local  MTFs.  Phase  II  (when  required)  is 
conducted  at  one  of  14  regional  medical  centers  (RMCs).  The  medical  review  includes  questions 
about  family  history,  health,  occupation,  and  unique  exposures  in  the  Gulf  War,  as  well  as  a 
structured  review  of  symptoms. 

Once  a  participant  has  completed  the  examination  processes,  copies  of  examination 
results  are  forwarded  to  the  CCEP  Program  Management  Team  (PMT),  where  they  undergo 
quality  assurance  procedures,  and  the  data  are  entered  into  the  master  CCEP  database. 

Additionally,  of  those  CCEP  participants  suffering  chronic,  debilitating  symptoms,  the 
DoD  has  established  an  SCC  at  Walter  Reed  Army  Medical  Center  and  will  have  a  second  center 
opening  in  mid  1996  at  Wilford  Hall  Medical  Center,  Lackland  AFT,  Texas. 

The  data,  which  were  initially  entered  into  a  relational  database,  were  translated  into  a 
statistical  format  for  this  (CCEP  Report  on  18,598  Participants)  report.  Various  validity  checks 
were  conducted  to  ensure  that  the  data  were  appropriated  for  interpretation.  Statistical  tests  and 
descriptive  analyses  were  conducted  on  various  categories  of  participants,  including  those  in 
theater  during  the  Persian  Gulf  War,  their  spouses,  and  their  children.  Moreover,  the  CCEP 
participants  who  were  in  theater  were  compared  to  the  PGW  population  as  a  whole  and  were 
stratified  by  units  to  compare  those  units  with  higher  CCEP  participation  to  those  units  with 
lower  CCEP  participation.  Specific  analyses  concerning  self-reported  exposures,  physician- 
elicited  symptoms,  diagnoses,  self-reported  reproductive  outcomes,  self-reported  lost  workdays, 
physical  evaluation  boards  (PEBs),  and  program  satisfaction  were  conducted.  Additionally,  a 
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comparative  analysis  with  the  NAMCS  data  was  conducted  using  age,  sex,  race,  ethnicity,  and 
diagnostic  code  variables  to  more  closely  match  the  CCEP  population. 
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APPENDIX  B.  DATA  DICTIONARY  OF  SELECTED  DaMI  FILES 
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Structure  for  table: 
Number  of  data  records: 
Date  of  last  update: 
Code  Page: 

Field     Field  Name 
Nulls 

1  RULE 

No 

2  CF 

No     ' 

3  CUMCF 

No 

4  GENERATN 

No 

5  SERVICE 

No 

6  SMOKE.NOW 

No 

7  SMOKE_PAST 

No 

8  OIL_SMOKE 

No 

9  HEAT_SMOKE 

No 

10  PASS_SMOKE 

No 

11  DIESL_FUEL 

No 

12  CARC_PAINT 

No 

13  OTHR_PATNT 

No 

14  OTHR_SOLVE 

No 

15  URANIUM 

No 

16  MICROWAVES 

No 

17  PESTICIDES 

No 

18  NERVE_GAS 

No 

19  PYRIDOSTIG 

No 

20  MUSTRD_GAS 

No 

21  CONTM_FOOD 

No 

22  CONTM_WATR 

No 

23  NONAF_WATR 

No 

24  NONAF_FOOD 

No 

25  ANTHRAX 

No 

26  BOTULISM 

No 

27  MALARIA 

No 

28  ACT_COMBAT 

No 

29  WOUNDED 

No 

30  CASUALTIES 

No 

31  SCUD_ATTAC 

No 

32  CHEM_ALARM 

No 

33  PQ_PRIOR 

No 

34  PQ_AFTER 

No 


C:\RESEARCH\VFP\VFPDOCS\DAMISAMP.DBF 

08/04/96 
1252 

Type 

Width 

Dec 

Index 

Integer 

4 

Numeric 

6 

2 

Numeric 

6 

2 

Integer 

4 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

3 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 


3 

3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
3 
162 


Page 
ield 


Fiel 


Structure  for  table: 
Number  of  data  records: 
Date  of  last  update: 
Code  Page: 

Field  Name 
Nulls 
RULE_NUMBE 

No 
NO_TRUE_LH 

No 
NOJTRUEJRH 

No 
NO_TRUE_BO 

No 
NO_FALSE_B 

No 
STAND  ARD_C 

No 
REVERSE_CF 

No 
COMPLEX_CF 

No 
VCOMPLEX 

No 
LHS_TEXT 

No 
RHS_TEXT 

No 
RHS_VERB 

No 
REF_NUM 
No 


ne 


8 


10 
11 


12 


13 


C:\RESEARCH\VFP\VFPDOCS\RULELIB.DBF 
5446 

08/04/% 

1252 

Type 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Character 

Character 

Character 

Integer 


Total 


Width 

8 

8 

8 

8 

8 

5 

5 

5 

5 

100 

100 

150 

4 

415 


Dec 


Index 


2 
2 
2 
2 


Desc 


APPENDIX  C.  TOP  100  HYPOTHESES  DISCOVERED  BY 
EXPOSURES-TO-DIAGNOSIS  AND  EXPOSURE-TO-SYMPTOM 

STUDIES 
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